版权所有 © 2020 Denise Gosnell 和 Matthias Broecheler。保留所有权利。
Copyright © 2020 Denise Gosnell and Matthias Broecheler. All rights reserved.
在美国印刷。
Printed in the United States of America.
由O'Reilly Media, Inc.出版,地址为 1005 Gravenstein Highway North, Sebastopol, CA 95472。
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O'Reilly 的书籍可用于教育、商业或促销用途。大多数书籍都有在线版本 ( http://oreilly.com )。如需更多信息,请联系我们的企业/机构销售部门:800-998-9938 或corporate@oreilly.com。
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.
有关发布详细信息,请参阅http://oreilly.com/catalog/errata.csp?isbn=9781492044079 。
See http://oreilly.com/catalog/errata.csp?isbn=9781492044079 for release details.
O'Reilly 徽标是 O'Reilly Media, Inc. 的注册商标。《图数据从业者指南》、封面图片和相关商业外观是 O'Reilly Media, Inc. 的商标。
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Practitioner’s Guide to Graph Data, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
本作品中表达的观点为作者的观点,不代表出版商的观点。尽管出版商和作者已尽最大努力确保本作品中包含的信息和说明准确无误,但出版商和作者对错误或遗漏不承担任何责任,包括但不限于因使用或依赖本作品而造成的损害的责任。使用本作品中包含的信息和说明的风险由您自行承担。如果本作品包含或描述的任何代码示例或其他技术受开源许可或他人的知识产权约束,则您有责任确保您对其的使用符合此类许可和/或权利。
The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
这项工作是 O'Reilly 和 DataStax 合作的一部分。请参阅我们的编辑独立性声明。
This work is part of a collaboration between O’Reilly and DataStax. See our statement of editorial independence.
978-1-492-04407-9
978-1-492-04407-9
[大规模集成电路]
[LSI]
想一想您上次在社交媒体平台上搜索某人的情况。
Think about the last time you searched for someone on a social media platform.
您在结果页面上看到了什么?
What did you look at on the results page?
最有可能的是,你开始浏览个人资料结果列表中的姓名。你可能花费了大部分时间查看“共同好友”部分,以了解你是如何认识某人的。
Most likely, you started scanning down the names in the list of profile results. And you probably spent most of your time inspecting the “shared friends” section to understand how you knew someone.
我们人类天生会思考社交媒体上的共同好友,这正是我们写这本书的灵感来源。不过,我们共同的灵感为我们写这本书提供了两个截然不同的原因。
Our innate human behavior of reasoning about our shared friends on social media is what inspired us to write this book. Though, our shared inspiration generated two very different reasons behind why we wrote this book.
首先,你有没有想过应用程序是如何创建“共享好友”部分的?
First, have you ever stopped to think about how an app creates the “shared friends” section?
在搜索结果中呈现“共同好友”所需的工程设计需要复杂的工具和数据编排,以解决极其复杂、分散的问题。我们要么构建了这些部分,要么创建了提供这些部分的工具。我们热衷于从集体经验中理解和教导他人,这是我们选择共同撰写这本书的首要原因。
The engineering required to deliver your “shared friends” in search results creates an intricate orchestration of tools and data to solve an extremely complex, distributed problem. We have either built those sections or created the tools that deliver them. Our passion for understanding and teaching others from our collective experiences is the first reason we chose to write this book together.
第二个原因是,任何使用社交媒体的人都会直观地从“共同好友”部分直接获取个人背景。这种推理和思考数据中的关系的过程被称为 图谱思维,这就是我们所说的人类通过关联数据理解生活的方法。
The second reason is that anyone who uses social media intuitively derives personal context directly from the “shared friends” section. This process of reasoning and thinking about relationships within data is called graph thinking, and that is what we name the human approach to understanding life through connected data.
我们是如何学会这样做的?
How did we all learn to do this?
我们并不是在某个特定的时间点学会这项技能的。处理人、地点或事物之间的关系只是我们的思维方式。
There wasn’t a specific point in time when we all were taught this skill. Processing relationships among people, places, or things is just how we think.
人们可以轻松地从关系中推断背景(无论是在现实生活中还是从数据中),这引发了图思维的浪潮。
It is the ease with which people infer context from relationships, be it in real life or from data, that has ignited the wave of graph thinking.
说到理解图思维,大多数人分为两派:一派认为图就是条形图,另一派认为图太复杂了。无论哪种方式,这些思维过程都应用了传统的数据和技术思维方法。问题是,可能性的艺术已经改变,我们的工具已经改进,还有新的教训需要学习。
And when it comes to understanding graph thinking, most people fall into one of two camps: those who think graphs are about bar charts, or those who think graphs are way too complicated. Either way, these thought processes apply legacy approaches to thinking about data and technology. The problem is that the art of the possible has changed, our tools have improved, and there are new lessons to learn.
我们相信图功能强大且易于部署。图技术可以提高您的工作效率;我们与一些团队合作,他们告诉我们这一点。
We believe that graphs are powerful and deployable. Graph technology can make you more productive; we have worked with teams that told us so.
这本书将这两种思维方式结合在一起。
This book brings these two mindsets together.
图思维缩小了人类运作/观察/生活的方式与我们使用数据来做出决策的方式之间的差距。想象一下,将整个世界视为一张包含行和列数据的电子表格,并试图理解其中的一切。对于我们大多数人来说,这种做法是不自然的,而且适得其反。
Graph thinking closes the gap between how we humans operate/see/live and how we use data to inform a decision. Imagine seeing your whole world as a spreadsheet with rows and columns of data and trying to make sense of it all. For the majority of us, the exercise is unnatural and counterproductive.
这是因为人际关系是人们探索和思考生活的方式。计算机需要数据库,并在数据行列的世界里运行。
This is because relationships are how people navigate and reason about life. It is computers that need databases and operate in the world of rows and columns of data.
图思维是一种以关系为中心的方法来解决复杂问题的方法。图技术弥补了“关系”与现代计算机基础设施的线性内存限制之间的差距。
Graph thinking is a way to solve complex problems by taking a relationship-centric approach. Graph technology bridges the gap between “relationships” and the linear memory constraints of modern computer infrastructure.
随着越来越多的人学习如何应用图思维来构建图技术,想象一下下一波创新将会带来什么。
As more people learn how to build with graph technology by applying graph thinking, imagine what the next wave of innovation will bring.
本书旨在教你两件事。首先,我们将通过提出问题和推理数据来教你图思维和图思维模式。其次,我们将指导你编写代码来解决最常见、最复杂的图问题。
This book aims to teach you two things. First, we will teach you about graph thinking and the graph mindset through asking questions and reasoning about data. Second, we will walk you through writing code that solves the most common, complex graph problems.
这些新概念与一些不同工程功能中常见的任务交织在一起。
These new concepts are intertwined within the tasks commonly performed across a few different engineering functions.
数据工程师和架构师是将创意从开发阶段转化为生产阶段的核心。我们编写本书的目的是向您展示如何使用图数据和图工具解决从开发阶段转化为生产阶段时可能出现的常见假设。数据工程师或数据架构师的另一个好处是了解图思维带来的无限可能。综合考虑图数据可以解决的各种问题,还可以帮助您发明新的模式,供其在生产应用中使用。
Data engineers and architects sit at the heart of transitioning an idea from development into production. We organized this book to show you how to resolve common assumptions that can occur when moving from development into production with graph data and graph tools. Another benefit to the data engineer or data architect will be learning the world of possibilities that come from understanding graph thinking. Synthesizing the breadth of problems that can be solved with graph data will also help you invent new patterns for their use in production applications.
数据科学家和数据分析师可能最受益于推理如何使用图数据来回答有趣的问题。本文中的所有示例都是为将查询优先方法应用于图数据而构建的。数据科学家或分析师的第二个好处是了解在生产应用程序中使用分布式图数据的复杂性。我们在整本书中教授并构建了常见的开发陷阱及其生产解决过程,以便您可以制定新类型的问题来解决。
Data scientists and data analysts may most benefit from reasoning about how to use graph data to answer interesting questions. All the examples throughout this text were constructed to apply a query-first approach to graph data. A secondary benefit for a data scientist or analyst will be to understand the complexity of using distributed graph data within a production application. We teach and build upon the common development pitfalls and their production resolution processes throughout the book so that you can formulate new types of problems to solve.
计算机科学家将学习如何使用函数式编程和分布式系统中的技术来查询和推理图数据。我们将概述程序遍历图数据的基本方法,并使用图工具逐步介绍它们的应用。在此过程中,我们还将了解分布式技术。
Computer scientists will learn how to use techniques in functional programming and distributed systems to query and reason about graph data. We will outline fundamental approaches to procedurally traversing graph data and step through their application with graph tools. Along the way we will learn about distributed technologies, too.
我们将在图数据和分布式复杂问题的交叉领域开展工作;这是一个令人着迷的工程主题组合,任何技术人员都可以从中学习到一些东西。
We will be working within the intersection of graph data and distributed, complex problems; a fascinating combination of engineering topics with something to learn for any technologist.
本书的首要目标是创建一个存在于非常多样化交叉点的新基础。我们将使用图论、数据库模式、分布式系统、数据分析和许多其他领域的概念。这种独特的交叉点形成了我们在本书中所说的图思维。新的应用领域需要新的术语、示例和技术。本书是您理解这一新兴领域的基础。
The first goal of this book is to create a new foundation that exists at a very diverse intersection. We will be working with concepts from graph theory, database schema, distributed systems, data analysis, and many other fields. This unique intersection forms what we refer to in this book as graph thinking. A new application domain requires new terms, examples, and techniques. This book serves as your foundation for understanding this emerging field.
在过去十年的图技术中,出现了一组在生产应用中使用图数据的常见模式。本书的第二个目标是教您这些模式。我们定义、说明、构建和实施团队使用图技术解决复杂问题的最流行方法。学习本书后,您将拥有一组使用图技术构建的模板来解决这组常见问题。
From the past decade of graph technology emerged a common set of patterns for using graph data in production applications. The second goal of this book is to teach you those patterns. We define, illustrate, build, and implement the most popular ways teams use graph technology to solve complex problems. After studying this book, you will have a set of templates for building with graph technology to solve this common set of problems.
本书的第三个目标是改变您的思维方式。理解图数据并将其应用于您的问题将为您的思维过程带来范式转变。通过许多即将出现的示例,我们旨在教您其他人在应用程序中思考和推理图数据的常见方式。本书将教您将图思维应用于技术决策所需的知识。
The third goal of this book is to transform how you think. Understanding and applying graph data to your problem introduces a paradigm shift into your thought processes. Through many upcoming examples, we aim to teach you the common ways that others think and reason about graph data within an application. This book teaches you what you need to know to apply graph thinking to a technology decision.
本书的结构大致如下:
This book is organized roughly as follows:
第1章讨论图思维并提供其应用于复杂问题的详细过程。
Chapter 1 discusses graph thinking and provides detailed processes for its application to complex problems.
Chapters 2 and 3 introduce fundamental graph concepts that will be used throughout the rest of the book.
Chapters 4 and 5 apply graph thinking and distributed graph technology to building a Customer 360 banking application, the most popular use case for graph data today.
Chapters 6 and 7 into the world of hierarchical data and nested graph data through a telecommunications use case. Chapter 6 sets the stage for a common error that is resolved in Chapter 7.
Chapters 8 and 9 discuss pathfinding across graph data in detail, using an example of quantifying trust in social transaction networks via paths.
Chapters 10 and 12 teach you how to use collaborative filtering on graph data to design a Netflix-inspired recommendation system.
第 11 章可以被视为一个奖励章节,它说明了如何将实体解析应用于将多个数据集合并为一个大图以进行集体分析。
Chapter 11 can be thought of as a bonus chapter that illustrates how to apply entity resolution to the merging of multiple datasets into one large graph for collective analysis.
每对章节(第 4 和第 5 章、第 6 和第 7 章、第 8 和第 9 章、第 10 和第 12 章)都遵循相同的结构。每对章节的第一章介绍了开发环境中的新概念和新的示例用例。第二章深入探讨了实际部署需要解决的生产问题(例如性能和可扩展性)的细节。
Each chapter pair (4 and 5, 6 and 7, 8 and 9, 10 and 12) follows the same structure. The first chapter in each pair introduces new concepts and a new example use case in a development environment. The second chapter delves into the details of production issues, such as performance and scalability, that need to be addressed for real-world deployments.
本书采用了以下印刷约定:
The following typographical conventions are used in this book:
表示新术语、URL、电子邮件地址、文件名和文件扩展名。
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant widthConstant width用于程序列表,以及段落内引用程序元素,例如变量或函数名称、数据库、数据类型、环境变量、语句和关键字。
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width boldConstant width bold显示应由用户逐字输入的命令或其他文本。
Shows commands or other text that should be typed literally by the user.
Constant width italicConstant width italic显示应由用户提供的值或由上下文确定的值替换的文本。
Shows text that should be replaced with user-supplied values or by values determined by context.
这个元素表示提示或建议。
This element signifies a tip or suggestion.
此元素表示一般注释。
This element signifies a general note.
此元素表示警告或警示。
This element indicates a warning or caution.
补充材料(代码示例、练习等)可在https://github.com/datastax/graph-book下载。
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/datastax/graph-book.
如果您有技术问题或使用代码示例时遇到问题,请发送电子邮件至bookquestions@oreilly.com。
If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com.
您也可以在 Twitter 上关注我们: https: //twitter.com/Graph_Thinking
You can also follow us on Twitter: https://twitter.com/Graph_Thinking
本书旨在帮助您完成工作。一般来说,如果本书提供了示例代码,您可以在程序和文档中使用它。除非您要复制大量代码,否则无需联系我们获取许可。例如,编写使用本书中几段代码的程序不需要许可。出售或分发 O'Reilly 书籍中的示例则需要许可。通过引用本书并引用示例代码来回答问题不需要许可。将本书中的大量示例代码合并到您的产品文档中则需要许可。
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
我们欢迎您注明出处,但一般不要求您注明。注明出处通常包括书名、作者、出版商和 ISBN。例如:“《 Denise Koessler Gosnell 和 Matthias Broecheler 著的图数据从业者指南》(O'Reilly)。版权所有 2020 Denise Gosnell 和 Matthias Broecheler,电话:978-1-492-04407-9。”
We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “The Practitioner’s Guide to Graph Data by Denise Koessler Gosnell and Matthias Broecheler (O’Reilly). Copyright 2020 Denise Gosnell and Matthias Broecheler, 978-1-492-04407-9.”
如果您认为您对代码示例的使用超出了合理使用或上述许可的范围,请随时通过permissions@oreilly.com与我们联系。
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.
40 多年来,O'Reilly Media一直提供技术和商业培训、知识和见解,帮助企业取得成功。
For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
我们独特的专家和创新者网络通过书籍、文章、会议和我们的在线学习平台分享他们的知识和专长。O'Reilly 的在线学习平台让您可以按需访问现场培训课程、深入的学习路径、交互式编码环境以及来自 O'Reilly 和 200 多家其他出版商的大量文本和视频。有关更多信息,请访问 http://oreilly.com。
Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com.
请将有关本书的评论和问题发送给出版商:
Please address comments and questions concerning this book to the publisher:
我们为这本书建立了一个网页,其中列出了勘误表、示例和任何其他信息。您可以通过http://www.oreilly.com/catalog/9781492044079访问该网页。
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://www.oreilly.com/catalog/9781492044079.
发送电子邮件至bookquestions@oreilly.com以发表评论或询问有关本书的技术问题。
Email bookquestions@oreilly.com to comment or ask technical questions about this book.
有关我们的书籍、课程、会议和新闻的更多信息,请访问我们的网站http://www.oreilly.com。
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
在 Facebook 上找到我们:http://facebook.com/oreilly
Find us on Facebook: http://facebook.com/oreilly
在 Twitter 上关注我们:http ://twitter.com/oreillymedia
Follow us on Twitter: http://twitter.com/oreillymedia
在 YouTube 上观看我们:http://www.youtube.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
我们要感谢这群出色的人,他们奉献了自己的时间和专业知识来为我们提供建议、阅读和修改这本书。
We would like to thank the incredible group of people who donated their time and expertise to advising us and to reading, and correcting this book.
我们很荣幸能与 Jeff Bleiel 领导的世界级编辑团队合作。我们的技术编辑团队由 Alexey Ott、Lorina Poland 和 Daniel Kuppitz 组成,他们运用了丰富的经验来创建、构建和编写图技术。他们的直接贡献将本书提升到了我们只有在他们的帮助下才能达到的水平。我们很荣幸他们竭尽全力提高本书的质量和正确性。谢谢。
We had the honor of working with an world-class editing team led by Jeff Bleiel. Our technical editing team of Alexey Ott, Lorina Poland, and Daniel Kuppitz applied their seasoned experience in creating, building, and writing about graph technologies. Their direct contributions elevated this book to a level that we could have reached only with their assistance. We are humbled that they went above and beyond to improve the quality and correctness of this text. Thank you.
我们还要感谢 DataStax 的赞助,感谢他们鼓励我们的团队合作编写本书。我们非常感谢 DataStax Graph Engineering 团队的支持和审核,以及他们在我们共同创作作品时所做的产品更改:Eduard Tudenhoefner、Dan LaRocque、Justin Chu、Rocco Varela、Ulises Cerviño Beresi、Stephen Mallette 和 Jeremiah Jordan。我们特别感谢 Bryn Cooke,他协调并实施了大量额外工作来支持本书中的想法。
We also would like to thank DataStax for its sponsorship and for encouraging our teams to collaborate on creating this book. We are very grateful for the support and review by the DataStax Graph Engineering team and for the product changes they made as we created our work together: Eduard Tudenhoefner, Dan LaRocque, Justin Chu, Rocco Varela, Ulises Cerviño Beresi, Stephen Mallette, and Jeremiah Jordan. We are especially grateful to Bryn Cooke, who coordinated and implemented a nontrivial amount of extra work to support the ideas in this book.
许多其他人超越了自己的义务,抽出时间来支持我们,这是 DataStax 的方式。我们要感谢 Dave Bechberger、Jonathan Lacefield 和 Jonathan Ellis 对这项工作的专业贡献和倡导。感谢 Daniel Farrell、Jeremy Hanna、Kiyu Gabriel、Jeff Carpenter、Patrick McFadin、Peyton Casper、Matt Atwater、Paras Mehra、Kelly Mondor 和 Jim Hatcher:我们在创作这项工作过程中的对话产生了比您想象的更大的影响,所以谢谢您。
Many additional people transcended their obligations to make time to support us, as is the DataStax way. We would like to thank Dave Bechberger, Jonathan Lacefield, and Jonathan Ellis for their expert contributions and advocacy for this work. To Daniel Farrell, Jeremy Hanna, Kiyu Gabriel, Jeff Carpenter, Patrick McFadin, Peyton Casper, Matt Atwater, Paras Mehra, Kelly Mondor, and Jim Hatcher: our conversations throughout the creation of this work had more of an impact than you realize, so thank you.
本文中的所有故事和示例均受我们与世界各地同事的合作和经验启发。为此,我们想感谢与我们交谈并帮助塑造本书叙述的图英雄:Matt Aldridge、Christine Antonsen、David Boggess、Sean Brandt、Vamsi Duvvuri、Ilia Epifanov、Amy Hodler、Adam Judelson、Joe Koessler、Eric Koester、Alec Macrae、Patrick Planchamp、Gary Richardson、Kristin Stone、Samantha Tracht、Laurent Weichberger 和 Brent Woosley。我们与你们每个人交谈的时间以及你们分享的信息都融入了我们有幸在本文中分享的故事中。感谢你们分享你们的声音、经验和想法。
All of the stories and examples throughout this text were inspired by our collaborations and experiences with colleagues around the world. To that end, we would like to recognize the graph heroes who spoke with us and helped shape this book’s narrative: Matt Aldridge, Christine Antonsen, David Boggess, Sean Brandt, Vamsi Duvvuri, Ilia Epifanov, Amy Hodler, Adam Judelson, Joe Koessler, Eric Koester, Alec Macrae, Patrick Planchamp, Gary Richardson, Kristin Stone, Samantha Tracht, Laurent Weichberger, and Brent Woosley. The time that we spent speaking with each of you and the information you shared made its way into the stories that we have the privilege of sharing in this text. Thank you for lending your voices, experiences, and ideas.
丹尼斯还要向在这段旅程中给予她指导的人表达她个人的谢意。特蕾莎·海恩斯 (Teresa Haynes) 和黛布拉·尼斯利 (Debra Knisley):你们点燃了我对图论的热情,并激励着我每天继续前行;没有你们,我就不会开始这段旅程。迈克·贝里 (Mike Berry):您教会了我如何完成任务,以及永不停止追求我的下一个伟大创意;谢谢你们。泰德·坦纳 (Ted Tanner):您打开了一扇门,让我看到了充满激情地构建和卓越交付意味着什么;时机和执行就是一切。迈克·坎佐内里 (Mike Canzoneri):不管您是否知道,您都是促使我写下这篇文章的元凶;谢谢你们。最重要的是,泰 (Ty),这位非官方的“第三作者”,一路陪伴着我:感谢您永不消退的积极态度。
Denise would also like to extend her personal gratitude to those who mentored her throughout this journey. To Teresa Haynes and Debra Knisley: you ignited my passion for graph theory that continues to drive me every day; I wouldn’t have started this journey without you. To Mike Berry: you taught me how to get things done and to never stop reaching for my next big idea; thank you. To Ted Tanner: you opened a door and showed me what it means to build with passion and deliver with excellence; timing and execution are everything. To Mike Canzoneri: whether you know it or not, you were the boot that kicked me over the line to write this; thank you. And most importantly, to Ty, the unofficial “third author” who was with me every step of the way: thank you for your never-ending positivity.
Think about the first time you learned about graph technology.
这一幕可能始于白板,你的团队由董事、建筑师、科学家和工程师组成,他们正在讨论你的数据问题。最后,有人将一个数据与另一个数据联系起来。退后一步后,有人注意到数据之间的联系构成了一张图。
The scene probably started at the whiteboard where your team of directors, architects, scientists, and engineers were discussing your data problems. Eventually, someone drew the connections from one piece of data to another. After stepping back, someone noted that the links across the data built up a graph.
这一认识激发了您团队的图之旅。该小组发现,您可以使用数据之间的关系为企业提供新的、强大的见解。个人或小组可能负责评估可用于存储、分析和/或检索图数据的技术和工具。
That realization sparked the beginning of your team’s graph journey. The group saw that you could use relationships across the data to provide new and powerful insights to the business. An individual or a small group was probably tasked with evaluating the techniques and tools available for storing, analyzing, and/or retrieving graph-shaped data.
您的团队可能发现的下一个重大发现是,用图来解释数据很容易,但用图来使用数据却很难。
The next major revelation for your team was likely that it’s easy to explain your data as a graph. But it’s hard to use your data as a graph.
听起来很熟悉?
Sound familiar?
就像这种白板体验一样,早期的团队发现了数据中的联系,并将其转化为我们每天使用的有价值的应用程序。想想 Netflix、LinkedIn 和 GitHub 等应用程序。这些产品将互联数据转化为全球数百万人使用的重要资产。
Much like this whiteboard experience, earlier teams discovered connections within their data and turned them into valuable applications we use everyday. Think about apps like Netflix, LinkedIn, and GitHub. These products translate connected data into an integral asset used by millions of people around the world.
我们写这本书就是为了教你他们是如何做到的。
We wrote this book to teach you how they did it.
作为工具构建者和工具使用者,我们曾数百次有机会坐在白板对话的两边。根据我们的经验,我们收集了一组核心选择和后续技术决策,以加速您使用图技术的旅程。
As both tool builders and tool users, we have had the opportunity to sit on both sides of the whiteboard conversation hundreds of times. From our experiences, we collected a core set of choices and subsequent technology decisions to accelerate your journey with graph technology.
本书将指导您如何将数据理解为图,以及如何将数据用作图。
This book will be your guide in navigating the space between understanding your data as a graph and using your data as a graph.
Graphs have been around for centuries. So why are they relevant now?
在您跳过此部分之前,我们请您听我们说完。我们即将进入历史;它不长,也不复杂。我们需要这样做,因为我们近期历史的成功和失败解释了为什么图技术再次具有相关性。
And before you skip this section, we ask you to hear us out. We are about to go into history here; it isn’t long, and it isn’t involved. We need to do this because the successes and failures of our recent history explain why graph technology is relevant again.
图之所以如此重要,是因为科技行业的关注点在过去几十年发生了变化。以前,技术和数据库专注于如何最有效地存储数据。关系技术发展成为实现这一效率的领跑者。现在我们想知道如何从数据中获取最大价值。
Graphs are relevant now because the tech industry’s focus has shifted over the last few decades. Previously, technologies and databases focused on how to most efficiently store data. Relational technologies evolved as the front-runner to achieve this efficiency. Now we want to know how we can get the most value out of data.
今天的认识是,当数据被连接起来时,它本质上就更有价值。
Today’s realization is that data is inherently more valuable when it is connected.
了解数据库技术演进的历史背景有助于我们更好地理解数据库技术的发展历程,甚至有助于我们了解为什么你会选择这本书。数据库技术的历史大致可以分为三个时代:分层、关系和 NoSQL。以下简要介绍将探索每个历史时代,重点介绍每个时代与本书的关联。
A little bit of historical context on the evolution of database technologies sheds a lot of light on how we got here, and maybe even on why you picked up this book. The history of database technology can loosely be divided into three eras: hierarchical, relational, and NoSQL. The following abbreviated tour explores each of these historical eras, with a focus on how each era is relevant to this book.
以下部分为您提供了图技术演变的精简版本。我们仅重点介绍我们行业悠久历史中最相关的部分。至少,我们可以节省您宝贵的时间,让您不必在自助式 Wikipedia 链接步行之旅中浪费宝贵的时间——尽管具有讽刺意味的是,自助式版本将遍历当今最易于访问的知识图谱。
The following sections provide you with an abridged version of the evolution of graph technology. We are highlighting only the most relevant parts of our industry’s vast history. At the very least, we are saving you from losing your valuable time down the rabbit hole of a self-guided Wikipedia link walking tour—though ironically, the self-guided version would be walking through today’s most accessible knowledge graph.
这段简短的历史将带我们从 20 世纪 60 年代走到今天。我们的旅程将以即将到来的图思维的第四个时代结束,如图1-1所示。我们邀请您与我们一起踏上这段短暂的旅程,因为我们相信历史背景是解锁图技术在我们行业内广泛应用的关键之一。
This brief history will take us from the 1960s to today. Our tour will culminate with the fourth era of graph thinking that is on our doorstep, as shown in Figure 1-1. We are asking you to take this short journey with us because we believe that historical context is one of the keys to unlocking the wide adoption of graph technologies within our industry.
技术文献经常将 20 世纪 60 年代的数据库技术称为20 世纪 80 年代,数据结构被称作“层次化”或“导航化”。无论标签如何,这个时代的思维都旨在以树状结构来组织数据。
Technical literature interchangeably labels the database technologies of the 1960s through the 1980s as “hierarchical” or “navigational.” Irrespective of the label, the thinking during this era aimed to organize data in treelike structures.
在这个时代,数据库技术将数据存储为相互链接的记录。这些系统的架构师设想遍历这些树状结构,以便可以通过密钥或系统扫描或通过浏览树的链接来访问任何记录。
During this era, database technologies stored data as records that were linked to one another. The architects of these systems envisioned walking through these treelike structures so that any record could be accessed by a key or system scan or through navigating the tree’s links.
20 世纪 60 年代初,CODASYL(数据系统语言会议/委员会)内的数据库任务组组织创建了业界第一套标准。数据库任务组创建了从这些树结构中检索记录的标准。这个早期标准被称为“CODASYL 方法”,并为从数据库管理系统中检索记录设定了以下三个目标:1
In the early 1960s, the Database Task Group within CODASYL, the Conference/Committee on Data Systems Languages, organized to create the industry’s first set of standards. The Database Task Group created a standard for retrieving records from these tree structures. This early standard is known as “the CODASYL approach” and set the following three objectives for retrieving records from database management systems:1
使用主键
Using a primary key
按顺序扫描所有记录
Scanning all the records in a sequential order
将链接从一条记录导航至另一条记录
Navigating links from one record to another
CODASYL 是一个成立于 1959 年的联盟,负责 COBOL 的创建和标准化。
CODASYL was a consortium formed in 1959 and was the group responsible for the creation and standardization of COBOL.
除了历史教训之外,我们还在努力实现一个具有讽刺意味的观点。在这种方法的开始阶段,CODASYL 的技术人员设想通过密钥、扫描和链接检索数据。到目前为止,我们已经看到这三个原始标准中的两个(密钥和扫描)有了重大创新和采用。
Aside from the history lesson, there is an ironic point we are building up to. At the inception of this approach, the technologists of CODASYL envisioned retrieving data by keys, scans, and links. To date, we have seen significant innovation in and adoption of two of these three original standards: keys and scans.
但是 CODASYL 检索标准化的第三个目标:将链接从一个记录导航到另一个记录,结果如何?根据记录之间的链接存储、导航和检索记录描述了我们今天所称的图技术。正如我们之前提到的,图并不新鲜;技术人员已经使用它们多年了。
But what happened with the third goal of CODASYL’s retrieval standardization: to navigate links from one record to another? Storing, navigating, and retrieving records according to the links between them describes what we refer to today as graph technology. And as we mentioned before, graphs are not new; technologists have been using them for years.
我们历史的这一部分的简短版本是 CODASYL 的链接导航技术太难而且太慢了。当时最具创新性的解决方案引入了 B 树,即自平衡树数据结构,作为解决性能问题的结构优化。在这种情况下,B 树通过提供跨链接记录的备用访问路径来帮助加快记录检索速度。2
The short version of this part of our history is that CODASYL’s link-navigating technologies were too difficult and too slow. The most innovative solutions at the time introduced B-trees, or self-balancing tree data structures, as a structural optimization to address performance issues. In this context, B-trees helped speed up record retrieval by providing alternate access paths across the linked records.2
最终,实施费用、硬件成熟度和交付价值之间的不平衡导致这些系统被搁置,而代之以速度更快的同类系统:关系系统。因此,CODASYL 如今已不复存在,尽管一些 CODASYL 委员会仍在继续工作。
Ultimately, the imbalance among implementation expenditures, hardware maturity, and delivered value resulted in these systems being shelved for their speedier cousin: relational systems. As a result, CODASYL no longer exists today, though some of the CODASYL committees continue their work.
Edgar F. Codd 提出将数据组织与检索系统分离的想法引发了数据管理技术的下一波创新浪潮。3 Codd的工作开创了我们至今仍称之为数据库实体关系时代的时代。
Edgar F. Codd’s idea to separate the organization of data from its retrieval system ignited the next wave of innovation in data management technologies.3 Codd’s work founded what we still refer to as the entity-relationship era of databases.
实体关系时代涵盖了我们的行业完善按键检索数据方法的几十年,这是 20 世纪 60 年代早期工作组设定的目标之一。在此期间,我们的行业开发了在存储、管理和检索表中数据方面极其高效的技术,并且至今仍然如此。这几十年来开发的技术至今仍然蓬勃发展,因为它们经过了测试、记录并且被很好地理解。
The entity-relationship era encompasses the decades when our industry polished the approach for retrieving data by a key, which was one of the objectives set by the early working groups of the 1960s. During this era, our industry developed technology that was, and still is, extremely efficient at storing, managing, and retrieving data from tables. The techniques developed during these decades are still thriving today because they are tested, documented, and well understood.
这个时代的系统引入并普及了一种特定的数据思维方式。首先,关系系统建立在关系代数的可靠数学理论之上。具体来说,关系系统将数据组织成集合。这些集合专注于存储和检索现实世界中的实体,例如人、地点和事物。类似的实体(例如人)被分组在一个表中。在这些表中,每条记录都是一行。通过主键从表中访问单个记录。
The systems of this era introduced and popularized a specific way of thinking about data. First and foremost, relational systems are built on the sound mathematical theory of relational algebra. Specifically, relational systems organize your data into sets. These sets focus on the storage and retrieval of real-world entities, such as people, places, and things. Similar entities, such as people, are grouped together in a table. In these tables, each record is a row. An individual record is accessed from the table by its primary key.
在关系系统中,实体可以链接在一起。要创建实体之间的链接,您需要创建更多表。链接表将组合每个实体的主键,并将它们作为新行存储在链接表中。这个时代及其中的创新者为表格数据创造了解决方案,这种解决方案至今仍蓬勃发展。
In relational systems, entities can be linked together. To create links between entities, you create more tables. A linking table will combine the primary keys of each entity and store them as a new row in the linking table. This era, and the innovators within it, created the solution for tabular-shaped data that still thrives today.
关于关系系统这个话题,有大量的书籍和资源可供参考。但本书并不打算成为其中之一。相反,我们希望重点介绍如今已被广泛接受的思维过程和设计原则。
There are volumes of books and more resources than one can mention on the topic of relational systems. This book does not intend to be one of them. Instead, we want to focus on the thought processes and design principles that have become widely accepted today.
不管是好是坏,这个时代引入并根深蒂固了“所有数据都映射到表格”的思想。
For better or for worse, this era introduced and ingrained the mentality that all data maps to a table.
如果您的数据需要在表中组织和检索,关系技术仍然是首选解决方案。但是,无论关系技术的作用有多么重要,它都不是万能的解决方案。
If your data needs to be organized in and retrieved from a table, relational technologies remain the preferred solution. But however integral their role remains, relational technologies are not a one-size-fits-all solution.
20 世纪 90 年代末,网络的普及带来了信息时代的早期迹象。在我们短暂的历史中,这一阶段暗示了以前未计划和未使用的数据量和数据形状。在数据库创新的这个时候,各种形状的难以理解的数据量开始填满应用程序的队列。此时的一个关键认识是缺乏关系模型:没有提到数据的预期用途。业界有一个详细的存储模型,但没有用于分析或智能应用这些数据的模型。
The late ’90s brought early signs of the information age through the popularization of the web. This stage during our short history hinted at volumes and shapes of data that were previously unplanned and unused. At this time in database innovation, incomprehensible volumes of data in diverse shapes began to fill the queues of applications. A key realization at this point was that the relational model was lacking: there was no mention of the intended use for the data. The industry had a detailed storage model, but nothing for analyzing or intelligently applying that data.
这将我们带入了第三次也是最新的数据库创新浪潮。
This brings us to the third and most recent wave of database innovation.
从 2000 年代到 2020 年代,数据库技术的发展大致可以归为 NoSQL(非 SQL 或“不仅仅是 SQL”)运动的出现。这个时代的目标是创建可扩展的技术来存储、管理和查询各种形式的数据。
The development of database technologies from the 2000s to the 2020s, approximately, is characterized as the advent of the NoSQL (non-SQL or “not only SQL”) movement. The objective of this era was to create scalable technologies that stored, managed, and queried all shapes of data.
描述 NoSQL 时代的一种方式是将数据库创新与美国精酿啤酒市场的蓬勃发展联系起来。啤酒的发酵过程没有改变,但增加了风味,原料的质量和新鲜度也得到了提升。酿酒师和消费者之间的联系更加紧密,对产品方向产生了即时反馈循环。现在,超市里的啤酒品牌可能不止三种,而是 30 多种。
One way to describe the NoSQL era relates database innovation to the burgeoning of the craft beer market in the United States. The process of fermenting the beer didn’t change, but flavors were added and the quality and freshness of ingredients were elevated. A closer connection developed between the brew master and the consumer, yielding an immediate feedback loop on product direction. Now, instead of three brands of beer in your supermarket, you likely have more than 30.
数据库行业并没有找到新的组合进行发酵,而是经历了数据管理技术选择呈指数级增长。架构师需要可扩展的技术来满足快速增长的应用程序的不同形状、容量和要求。在此过程中出现的流行数据形状包括键值、宽列、文档、流和图。
Instead of finding new combinations for fermentation, the database industry experienced exponential growth in choices for data management technologies. Architects needed scalable technologies to address the different shapes, volumes, and requirements of their rapidly growing applications. Popular data shapes that emerged during this movement were key-value, wide-column, document, stream, and graph.
NoSQL 时代的信息非常明确:在表中大规模存储、管理和查询数据并不适用于所有情况,就像不是每个人都想喝淡皮尔森啤酒一样。
The message of the NoSQL era was quite clear: storing, managing, and querying data at scale in tables doesn’t work for everything, just like not everyone wants to drink a light pilsner.
有几个动机促成了 NoSQL 运动。这些动机对于理解我们处于图技术市场炒作周期的原因和位置至关重要。我们想指出的三个动机是数据序列化标准、专用工具和水平可扩展性的需求。
There were a few motivations that led to the NoSQL movement. These motivations are integral to understanding why and where we are within the hype cycle of the graph technology market. The three we like to call out are the need for data serialization standards, specialized tooling, and horizontal scalability.
首先,基于 Web 的应用程序的普及为这些应用程序之间传递数据创造了自然渠道。通过这些渠道,创新者开发了新的和不同的数据序列化标准,例如 XML、JSON 和 YAML。
First, the rise in popularity of web-based applications created natural channels for passing data between these applications. Through these channels, innovators developed new and different standards for data serialization such as XML, JSON, and YAML.
自然而然,这些标准化导致了第二个动机:专业化工具。用于在网络上交换数据的协议创建了本质上非表格的结构。这种需求导致了键值、文档、图和其他专业数据库的创新和流行。
Naturally, these standardizations led to the second motivation: specialized tooling. The protocols for exchanging data across the web created structures that were inherently not tabular. This demand led to the innovation and rise in popularity of key-value, document, graph, and other specialized databases.
最后,这类新应用程序带来了大量数据,给系统可扩展性带来了前所未有的压力。摩尔定律的衍生品和应用预示着这个时代的一线希望,因为我们看到硬件成本以及数据存储成本不断下降。摩尔定律的影响使数据复制、专用系统和整体计算能力变得更便宜。4
Last, this new class of applications came with an influx of data that put pressure on system scalability like never before. Derivatives and applications of Moore’s law predicted the silver lining of this era as we saw the cost of hardware, and thus the cost of data storage, continue to decrease. The effects of Moore’s law enabled data duplication, specialized systems, and overall computation power to become less expensive.4
总之,NoSQL 时代的创新和新需求为行业从纵向扩展系统向横向扩展系统的迁移铺平了道路。横向扩展系统添加物理或虚拟机来增加系统的整体计算能力。横向扩展系统通常称为“集群”,对最终用户来说,它是一个单一平台;用户不知道他们的工作负载实际上是由一组服务器提供服务的。另一方面,纵向扩展系统采购了更强大的机器。没有空间?买一个更大的盒子,价格更贵,直到没有更大的盒子可买。
Together, the innovations and new demands of the NoSQL era paved the way for the industry’s migration from scale-up systems to scale-out systems. A scale-out system adds physical or virtual machines to increase the overall computational capacity of a system. A scale-out system, generally referred to as a “cluster,” appears to the end-user as a single platform; the user has no idea that their workload is actually being served by a collection of servers. On the other hand, a scale-up system procures more powerful machines. Out of room? Get a bigger box, which is more expensive, until there are no bigger boxes to get.
扩展意味着添加更多资源来分散负载,通常是并行的。扩大规模意味着使资源变得更大或更快,以便能够处理更多的负载。
Scaling out means adding more resources to spread out a load, typically in parallel. Scaling up means making a resource bigger or faster so that it can handle more load.
鉴于这三个动机,这套用于构建非表格数据可扩展数据架构的多功能工具集逐渐成为 NoSQL 时代最重要的可交付成果。现在,开发团队在设计下一个应用程序时可以评估多种选择。他们可以从一套技术中进行选择,以适应数据的不同形状、速度和可扩展性要求。有一些工具可以管理、存储、搜索和检索任何规模的文档、键值、宽列和/或图数据。有了这些工具,我们开始以以前无法实现的方式处理多种形式的数据。
Given these three motivations, this versatile tool set for building scalable data architectures for nontabular data evolved to be the most important deliverable of the NoSQL era. Now development teams have choices to evaluate when designing their next application. They can select from a suite of technologies to accommodate different shapes, velocities, and scalability requirements of their data. There are tools that manage, store, search, and retrieve document, key-value, wide-column, and/or graph data at any scale. With these tools, we began working with multiple forms of data in ways previously unachievable.
我们能用这套独特的工具和数据做什么呢?我们可以更快、更大规模地解决更复杂的问题。
What can we do with this unique collection of tools and data? We can solve more complex problems faster and at a larger scale.
我们向您保证,我们的历史之旅将会简短而有目的。本节将通过连接我们精简之旅中的重要时刻来兑现这一承诺。总的来说,我们在行业历史中看到的联系为数据库创新的第四个时代奠定了基础:图思维浪潮。
We promised you that our history tour would be brief and purposeful. This section delivers on that promise by connecting the important moments from our condensed tour. Together, the connections we see across our industry’s history set the stage for the fourth era of database innovation: the wave of graph thinking.
这个创新时代正在从存储系统的效率转向从存储系统包含的数据中提取价值。
This era in innovation is shifting from efficiency of the storage systems to extracting value from the data the storage systems contain.
在我们概述对图时代的看法之前,您可能想知道为什么我们在 2020 年开启图思维时代。我们想花一点时间来解释一下我们对图市场时机的立场。
Before we can outline our perspective on the graph era, you might be wondering why we are starting the era of graph thinking in 2020. We want to take a brief moment to explain our position on the timing of the graph market.
我们对 2020 年总体时间表的呼唤源自两种思路的交汇。在这个交汇处,我们将 Geoffrey Moore 流行的采用模型5与过去三个数据库创新时代观察到的时间相交叉。
Our callout to the general timeline of 2020 comes from the intersection of two trains of thought. At this intersection, we are crossing Geoffrey Moore’s popular adoption model5 with the timing observed during the past three eras of database innovation.
与 CODASYL 一样,通常归因于 Moore 的技术采用生命周期起源于 20 世纪 50 年代。请参阅 Everett Roger 1962 年出版的《创新的扩散》一书。6
Like CODASYL, the technology adoption life cycle commonly attributed to Moore originated in the 1950s. See Everett Roger’s 1962 book Diffusion of Innovations.6
具体而言,新技术的早期采用者和广泛采用之间存在着明显且可观察到的时间差。我们在20 世纪 70 年代的关系数据库“1980 年代 - 2000 年代:实体关系”中看到了这种时间差。第一篇论文与相应的关系技术可行实现之间存在 10 年的滞后。您可以在其他每个时代中找到相同时间差的例子。
Specifically, there is a proven and observable time lag between early adopters and the wide adoption of new technologies. We saw this time lag in “1980s–2000s: Entity-Relationship” with relational databases during the 1970s. There was a 10-year lag between the first paper and corresponding viable implementations of relational technology. You can find examples of the same time lag within each of the other eras.
历史告诉我们,在图时代之前的每个时代都包含一个小众时期,这些时期在几年后才被广泛采用。展望 2020 年代,我们对图市场的状况做出了同样的假设。历史还告诉我们,这并不意味着现有的工具将会消失。
History has shown us that every era prior to the graph era contained a niche period that saw wide adoption years later. By looking to the 2020s, we are making this same assumption about the state of the graph market. History has also shown us that this doesn’t mean that the existing tools are going to go away.
无论你想如何衡量,这并不是一个确定日期的股市预测。我们的展望最终描述了一个由价值演变驱动的技术采用新时代。也就是说,价值正在从效率转向来自高度连接的数据资产。这些变化需要时间,并且不会按计划进行。
However you would like to measure it, this is not a stock market prediction where we are nailing down a date. Our outlook ultimately describes a new era of technology adoption that is being driven by an evolution of value. That is, value is shifting from efficiency to being derived from highly connected data assets. These changes take time and do not run on schedules.
回想一下 20 世纪 60 年代 CODASYL 委员会设想的三种检索模式:通过按键、扫描和链接访问数据。无论以何种形式,通过密钥提取数据仍然是访问数据的最有效方式。这种效率是在实体关系时代实现的,并且仍然是一种流行的解决方案。
Recall the three patterns of retrieval envisioned by the CODASYL committee in the 1960s: accessing data by keys, scans, and links. Extracting a piece of data by its key, in any shape, remains the most efficient way to access it. This efficiency was achieved during the entity-relationship era and remains a popular solution.
至于 CODASYL 委员会的第二个目标,即通过扫描访问数据,NoSQL 时代创造了能够处理大量数据扫描的技术。现在,我们拥有能够处理海量数据集并从中提取价值的软件和硬件。也就是说:我们已经确定了委员会的前两个目标。
As for the second goal of the CODASYL committee, accessing data through scans, the NoSQL era created technologies capable of handling large scans of data. Now we have software and hardware capable of processing and extracting value from massive datasets at immense scale. That is to say: we have the committee’s first two goals nailed down.
列表上的最后一项:通过遍历链接访问数据。我们的行业已经完成了一次循环。
Last on the list: accessing data by traversing links. Our industry has come full circle.
行业重新关注图技术与我们从高效管理数据到需要从中提取价值的转变相辅相成。这种转变并不意味着我们不再需要高效管理数据;它意味着我们已经很好地解决了一个问题,正在继续解决更难的问题。我们的行业现在强调价值,而不仅仅是速度和成本。
The industry’s return to focusing on graph technologies goes hand in hand with our shift from efficiently managing data to needing to extract value from it. This shift doesn’t mean we no longer need to efficiently manage data; it means we have solved one problem well and are moving on to address the harder problem. Our industry now emphasizes value alongside speed and cost.
当您能够连接信息片段并构建新见解时,就可以从数据中提取价值。从数据中提取价值源于理解数据中复杂的关系网络。
Extracting value from data can be achieved when you are able to connect pieces of information and construct new insights. Extracting value in data comes from understanding the complex network of relationships within your data.
这与识别在数据固有网络中可观察到的复杂问题和复杂系统同义。
This is synonymous with recognizing the complex problems and complex systems that are observable across the inherent network in your data.
我们行业和本书的重点是开发和部署能够从数据中获取价值的技术。与关系时代一样,需要一种新的思维方式来理解、部署和应用这些技术。
Our industry’s and this book’s focus looks toward developing and deploying technologies that deliver value from data. As in the relational era, a new way of thinking is required to understand, deploy, and apply these technologies.
要想看到我们在这里谈论的价值,就需要转变思维方式。这种思维方式是从思考表格中的数据转变为优先考虑表格中的关系。这就是我们所说的图思维。
A shift in mindset needs to occur in order to see the value we are talking about here. This mindset is a shift from thinking about your data in a table to prioritizing the relationships across it. This is what we call graph thinking.
虽然没有明确说明,但在本章开头的白板场景中,我们已经了解了所谓的图思维。
Without explicitly stating it, we already walked through what we call graph thinking during the whiteboard scene at the beginning of this chapter.
当我们阐明数据可以像图一样呈现时,我们重现了图思维的力量。就是这么简单:当您看到理解数据间关系的价值时,图思维涵盖了您的经验和认识。
When we illustrated the realization that your data could look like a graph, we were recreating the power of graph thinking. It is that simple: graph thinking encompasses your experience and realizations when you see the value of understanding relationships across your data.
图思维是将问题域理解为一个相互连接的图,并使用图技术描述领域动态,以解决领域问题。
Graph thinking is understanding a problem domain as an interconnected graph and using graph techniques to describe domain dynamics in an effort to solve domain problems.
能够查看数据图就如同识别域内的复杂网络一样。在复杂网络中,您将发现最复杂的问题需要解决。大多数高价值业务问题和机会都是复杂问题。
Being able to see graphs across your data is the same as recognizing the complex network within your domain. Within a complex network, you will find the most complex problems to solve. And most high-value business problems and opportunities are complex problems.
这就是为什么数据技术创新的下一阶段将从关注效率转向关注通过具体应用图技术来寻找价值。
This is why the next stage of innovation in data technologies is shifting from a focus on efficiency to a focus on finding value by specifically applying graph technologies.
我们已经多次使用了复杂问题这个术语,但没有提供具体的描述。当我们谈论复杂问题时,我们指的是复杂系统内的网络。
We have used the term complex problem a few times now without providing a specific description. When we talk about complex problems, we are referring to the networks within complex systems.
复杂问题是复杂系统中可观察和可测量的单个问题。
Complex problems are the individual problems that are observable and measurable within complex systems.
复杂系统是由许多单个组件组成的系统它们以各种方式相互关联,使得整个系统的行为不仅仅是各个组件行为的简单集合(称为“突发行为”)。
Complex systems are systems composed of many individual components that are interconnected in various ways such that the behavior of the overall system is not just a simple aggregate of the individual components’ behavior (called “emergent behavior”).
复杂系统描述了现实世界结构中各个组成部分之间的关系、影响、依赖关系和相互作用。简而言之,复杂系统描述了多个组成部分相互作用的任何事物。复杂系统的例子包括人类知识、供应链、交通或通信系统、社会组织、地球的全球气候以及整个宇宙。
Complex systems describe the relationships, influences, dependencies, and interactions among the individual components of real-world constructs. Simply put, a complex system describes anything where multiple components interact with each other. Examples of complex systems are human knowledge, supply chains, transportation or communication systems, social organization, earth’s global climate, and the entire universe.
大多数高价值业务问题都是复杂问题,需要图思维。本书将教您使用图技术为全球企业解决复杂问题的四种主要模式——邻域、层次结构、路径和建议。
Most high-value business problems are complex problems and require graph thinking. This book will teach you the four main patterns—neighborhoods, hierarchies, paths, and recommendations—used to solve complex problems with graph technology for businesses around the world.
数据不再只是商业运作的副产品。数据正日益成为我们经济的战略资产。 过去,数据是企业运营所需的最便捷、最低成本的管理工具,而现在,数据被视为一项需要产生回报的投资,这要求我们重新思考如何处理和运用数据。
Data is no longer just a by-product of doing business. Data is increasingly becoming a strategic asset in our economy. Previously, data was something that needed to be managed with the greatest convenience and the least cost to enable business operation. Now it is treated as an investment that should yield a return. This requires us to rethink how we handle and work with data.
例如,NoSQL 时代的后期,微软收购了 LinkedIn 和 GitHub。这些收购衡量了解决复杂问题的数据的价值。具体来说,微软以 260 亿美元收购了 LinkedIn,估计收入为 10 亿美元。GitHub 的收购价格定为 78 亿美元,估计收入为 3 亿美元。
For example, the late stage of the NoSQL era saw the acquisitions of LinkedIn and GitHub by Microsoft. These acquisitions gave measurement to the value of data that solves complex problems. Specifically, Microsoft acquired LinkedIn for $26 billion on an estimated $1 billion in revenue. GitHub’s acquisition set the price at $7.8 billion on an estimated $300 million in revenue.
LinkedIn 和 GitHub 这两家公司都拥有各自网络的图谱。它们的网络分别是专业图谱和开发者图谱。这为建模域复杂系统的数据带来了 26 倍的乘数。这两次收购开始说明建模域图谱的数据的战略价值。拥有域图谱可以为公司的估值带来可观的回报。
Each of these companies, LinkedIn and GitHub, owns the graph to its respective networks. Their networks are the professional and the developer graphs, respectively. This puts a 26× multiplier on the data that models a domain’s complex system. These two acquisitions begin to illustrate the strategic value of data that models a domain’s graph. Owning a domain’s graph yields significant return on a company’s valuation.
我们不想用这些统计数据歪曲我们的意图。观察快速增长的初创公司的高收入倍数并不是什么新鲜事。我们特别提到这两个例子是因为 GitHub 和 LinkedIn 从数据中发现并货币化了价值。由于数据资产的存在,这些收入倍数高于类似规模和类似增长的初创公司的估值。
We do not want to misrepresent our intentions with these statistics. Observing high revenue multiples for fast-growing startups isn’t a novelty. We specifically mention these two examples because GitHub and LinkedIn found and monetized value from data. These revenue multiples are higher than those valuations for similarly sized and similarly growing startups because of the data asset.
通过应用图思维,这些公司能够表示、访问和理解其领域内最复杂的问题。简而言之,这些公司为一些最大、最困难的复杂系统构建了解决方案。
By applying graph thinking, these companies are able to represent, access, and understand the most complex problem within their domain. In short, these companies built solutions for some of the largest and most difficult complex systems.
那些在重新思考数据战略方面处于领先地位的公司,是那些开发了技术来模拟其领域最复杂问题的公司。具体来说,谷歌、亚马逊、联邦快递、威瑞森、Netflix 和 Facebook 有什么共同点?除了是当今最有价值的公司之外,每家公司都拥有模拟其领域最大、最复杂问题的数据。每家公司都拥有构建其领域图的数据。
Companies that have a head start on rethinking data strategies are those that built technology to model their domains’ most complex problems. Specifically, what do Google, Amazon, FedEx, Verizon, Netflix, and Facebook all have in common? Aside from being among today’s most valued companies, each one owns the data that models its domain’s largest and most complex problem. Each owns the data that constructs its domain’s graph.
想想看。谷歌拥有所有人类知识的图谱。亚马逊和联邦快递包含了我们的全球供应链和运输经济的图谱。Verizon 的数据构建了我们世界上最大的电信图谱。Facebook 拥有我们的全球社交网络图谱。Netflix 可以访问娱乐图谱,该图谱在图 1-2中建模并在本书的最后几章中实现。
Just think about it. Google has the graph of all human knowledge. Amazon and FedEx contain the graphs of our global supply chain and transportation economies. Verizon’s data builds up our world’s largest telecommunications graph. Facebook has the graph of our global social network. Netflix has access to the entertainment graph, modeled in Figure 1-2 and implemented in the final chapters of this book.
展望未来,那些投资于数据架构来建模其领域复杂系统的公司将加入这些巨头的行列。对建模复杂系统的技术的投资与优先从数据中提取价值是一样的。
Going forward, those companies that invest in data architectures to model their domains’ complex systems will join the ranks of these behemoths. The investment in technologies for modeling complex systems is the same as prioritizing the extraction of value from data.
如果您想从数据中获取价值,首先要考虑的是其互连性。您要寻找的是数据描述的复杂系统。从这里开始,您的下一步决策将围绕存储、管理和提取这种互连性的正确技术展开。
If you want to get value out of your data, the first place to look is within its interconnectivity. What you are looking for is the complex system that your data describes. From there, your next decisions center around the right technologies for storing, managing, and extracting this interconnectivity.
无论您是否在前面提到的公司工作,您都可以学习将图思维应用于您所在领域的数据。
Whether or not you work at one of the companies previously mentioned, you can learn to apply graph thinking to the data in your domain.
那么从哪里开始呢?
So where do you get started?
学习和应用图思维的难点在于识别关系在数据中哪些地方能增加价值,哪些地方不能。我们使用本节中的两幅图来简化沿途的停留,并说明未来的挑战。
The difficulty with learning and applying graph thinking begins with recognizing where relationships do or do not add value within your data. We use the two images in this section to simplify the stops along the way and illustrate the challenges ahead.
图 1-3虽然简单,但要求您评估有关数据的关键问题。第一个决定要求您的团队了解应用程序所需的数据类型。我们专门从这个问题开始,因为它经常被忽视。
Though simple, Figure 1-3 challenges you to evaluate pivotal questions about your data. This first decision requires your team to know the type of data your application requires. We specifically start with this question because it is often overlooked.
在您之前,其他团队忽略了图 1-3所示的选择,因为新事物的诱惑使他们偏离了构建生产应用程序的既定流程。新事物与既定事物之间的这种紧张关系导致早期团队在对其应用程序目标的严格评估中行动过快。因此,我们看到许多图项目失败并被搁置。
Other teams before yours have overlooked the choices shown in Figure 1-3 because the lure of the new distracted them from following established processes for building production applications. This strain between new and established caused early teams to move too quickly through a critical evaluation of their application’s goals. Because of this, we saw many graph projects fail and be shelved.
让我们逐步解释图 1-3中的含义,以避免您重复图技术早期采用者常犯的错误。
Let’s step through what we mean in Figure 1-3 to keep you from repeating the common mistakes of early adopters of graph technologies.
思考数据的方式有很多种。决策树中的第一个问题要求您了解应用程序所需的数据形状。例如,LinkedIn 上的相互联系部分是对图 1-3中问题 1 的“是”回答的一个很好的例子。LinkedIn 使用联系人之间的关系,以便您可以浏览您的专业网络并了解您的共享联系。向最终用户呈现相互联系部分是一种非常流行的方式,Twitter、Facebook 和其他社交网络应用程序也使用图数据。
There are many ways of thinking about data. This first question in the decision tree challenges you to understand the shape of data that your application requires. For example, the mutual connections section on LinkedIn is a great example of a “yes” answer to question 1 in Figure 1-3. LinkedIn uses relationships between contacts so you can navigate your professional network and understand your shared connections. Presenting a section of mutual connections to an end user is a very popular way that graph shaped data is also used by Twitter, Facebook, and other social networking applications.
当我们说“数据形状”时,我们指的是您想要从数据中获取的有价值信息的结构。你想知道一个人的姓名和年龄吗?我们将其描述为适合表格的一行数据。你想知道本书中的哪一章、哪一节、哪一页以及向你展示如何向图添加顶点的示例吗?我们将其描述为适合文档或层次结构的嵌套数据。你想知道将你与伊隆·马斯克联系起来的一系列朋友的朋友吗?在这里,你要问的是一系列最适合图的关系。
When we say “shape of data,” we are referring to the structure of the valuable information you want to get out of your data. Do you want to know the name and age of a person? We would describe that as a row of data that would fit into a table. Do you want to know the chapter, section, page, and example in this book that shows you how to add a vertex to a graph? We would describe that as nested data that would fit into a document or hierarchy. Do you want to know the series of friends of your friends that connect you to Elon Musk? Here you are asking for a series of relationships that best fit into a graph.
自上而下思考,我们建议根据数据的形状来决定数据库和技术选项。现代应用程序中常用的数据类型如表1-1所示。
Thinking top-down, we advise that the shape of your data drive the decision about your database and technology options. The types of data commonly used in modern applications are shown in Table 1-1.
| 数据描述 | 数据形状 | 用法 | 数据库推荐 |
|---|---|---|---|
电子表格或表格 Spreadsheets or tables |
关系型 Relational |
通过主键检索 Retrieved by a primary key |
RDBMS 数据库 RDBMS databases |
文件或文档的集合 Collections of files or documents |
分层或嵌套 Hierarchical or nested |
通过 ID 标识的根 Root identified by an ID |
文档数据库 Document databases |
关系或联系 Relationships or links |
图 Graph |
按模式查询 Queried by a pattern |
图数据库 Graph databases |
对于当今最有趣的数据问题,您需要能够运用这三种思考数据的方式。您需要熟练地将每一种方式应用于数据问题及其子问题。对于问题的每个部分,您需要了解进入、驻留在应用程序内和离开应用程序的数据形态。这些点中的每一个,以及数据传输的任何时间,都会推动应用程序中技术选择的要求。
For the most interesting data problems today, you need to be able to apply all three ways of thinking about your data. You need to be fluent in applying each to your data problem and its subproblems. For each piece of your problem, you need to understand the shape of the data coming into, residing within, and leaving your application. Each of these points, and any time in which data is in flight, drives the requirements for technology choices in your application.
如果您不确定问题所需的数据形状,图 1-3中的下一个问题将挑战您思考数据中关系的重要性。
If you are unsure about the shape of data that your problem requires, the next question from Figure 1-3 challenges you to think about the importance of relationships within your data.
图 1-3中更关键的问题是,您的数据中是否存在关系,并为您的业务问题带来价值。图技术的成功使用取决于决策树中的第二个问题。对我们来说,这个问题只有三个答案:是、否或可能。
The more pivotal question from Figure 1-3 asks whether relationships within your data exist and bring value to your business problem. A successful use of graph technology hinges on applying this second question from the decision tree. To us, there are only three answers to this question: yes, no, or maybe.
如果您可以自信地回答是或否,那么路径就很明确了。例如,LinkedIn 的相互联系部分对于图形状的数据给出了明确的“是”答案,而 LinkedIn 的搜索框需要分面搜索功能,因此答案显然是“否”。我们可以通过了解解决业务问题所需的数据形状来做出这些明确的区分。
If you can confidently answer yes or no, then the path is clear. For example, LinkedIn’s mutual connection section exemplifies a clear “yes” for graph-shaped data whereas LinkedIn’s search box requires faceted search functionality and is a clear “no.” We can make these clear distinctions by understanding the shape of data required to solve the business problem.
如果数据中的关系有助于解决您的业务问题,那么您需要在应用程序中使用和应用图技术。如果没有,那么您需要寻找其他工具。也许表 1-1中的选择可以解决您手头的问题。
If relationships within your data help solve your business problem, then you need to use and apply graph technologies within your application. If they do not, then you need to find a different tool. Maybe a choice from Table 1-1 will be a solution for your problem at hand.
当您不能完全确定关系对您的业务问题是否重要时,棘手的部分就会出现。这在图 1-3左侧的“可能?”选项中显示。根据我们的经验,如果您的思路将您带到了这个决策点,那么您可能正在尝试解决太大的问题。我们建议您分解问题并从图 1-3的顶部开始。我们建议团队分解的最常见问题是实体解析,或者知道数据中谁是谁。第 11 章详细介绍了何时在实体解析中使用图结构的示例。
The tricky part comes into play when you aren’t exactly sure whether relationships are important to your business problem. This is shown with the “Maybe?” choice at left in Figure 1-3. In our experience, if your line of thinking brings you to this decision point, then you are likely trying to solve too large of a problem. We advise that you break down your problem and start back at the top of Figure 1-3. The most common problem we advise teams to break down is entity resolution, or knowing who-is-who in your data. Chapter 11 details an example of when to use graph structure within entity resolution.
有时,将数据形状视为图可以包含其他两种数据形状的重要性:嵌套和表格。团队常常会误解这个转移注意力的话题。
Sometimes, seeing the shape of your data as a graph can subsume the importance of the other two data shapes: nested and tabular. Teams commonly misinterpret this red herring.
虽然您可能将问题视为一个复杂问题,并因此采用图思维来理解它,但这并不意味着您必须将图技术应用于问题的所有数据组件。事实上,将某些组件或子问题投影到表格或嵌套文档上可能会更有优势。
While you may think about your problem as a complex problem and therefore employ graph thinking to make sense of it, that does not mean you have to apply graph technologies to all data components of your problem. In fact, it may be advantageous to project certain components or subproblems onto tables or nested documents.
以投影(文件或表格)的方式思考总是有用的。因此,图 1-3中的思考练习不仅仅是“思考数据的最佳方式是什么?”,而是深入研究更敏捷的思维过程,将复杂问题分解为更小的部分。也就是说,我们鼓励您考虑针对当前手头的问题思考数据的最佳方式。
It will always be useful to think in projections (to files or tables). So our thought exercise in Figure 1-3 is more than “Which is the best way to think about your data?” It is above delving into a more agile thought process to break down complex problems into smaller components. That is, we encourage you to consider the best way to think about your data for the current problem at hand.
在图 1-3中,我们试图达的最简短的意思是:针对手头的问题使用正确的工具。当我们在这里说“工具”时,我们的思考范围非常广泛。我们不一定使用该术语来指代数据库的选择;我们更广泛地考虑数据表示选择的范围。
The shortest version of what we are trying to say in Figure 1-3 is: use the right tool for the problem at hand. And when we say “tool” here, we are thinking very broadly. We aren’t necessarily using that term to refer to the choice of databases; we are thinking more broadly about the scope of data representation choices.
图 1-3中的第一个问题要求您将查询驱动设计应用于数据表示决策。您的复杂问题的某些部分可能最适合用表格或嵌套文档来表示。这是意料之中的。
The first question from Figure 1-3 challenges you to apply query-driven design to your data representation decisions. There may be parts of your complex problem that are best represented with tables or nested documents. That is expected.
但是当你有图数据并需要使用它时会发生什么呢?这就引出了图思维过程的第二部分,如图 1-4所示。
But what happens when you have graph data and need to use it? This brings us to the second part of our graph thinking thought process, shown in Figure 1-4.
展望未来,我们假设您的应用程序将受益于理解、建模和使用数据中的关系。
Moving forward, we are assuming that your application benefits from understanding, modeling, and using the relationships within your data.
在图谱技术的世界里,你需要对图谱数据做两件主要的事情:分析或查询。继续以 LinkedIn 为例,相互联系部分是查询图谱数据并将其加载到视图中的一个例子。LinkedIn 的研究团队可能会跟踪任何两个人之间的平均联系数,这是分析图谱数据的一个例子。
Within the world of graph technologies, there are two main things you will need to do with your graph data: analyze it or query it. Continuing the LinkedIn example, the mutual connections section is an example of when graph data is queried and loaded into view. LinkedIn’s research team probably tracks the average number of connections between any two people, which is an example of analyzing graph data.
第三个问题的答案将图技术决策分为两个阵营:数据分析与数据管理。图 1-4的中心显示了这个问题以及每个选项的决策流程。
The answer to this third question divides graph technology decisions into two camps: data analysis versus data management. The center of Figure 1-4 shows this question and the decision flow for each option.
当我们说分析时,我们指的是需要检查数据的时候。通常,团队会花时间研究数据中的关系,目的是找出哪些关系是重要的。这个过程不同于查询图数据。查询是指需要从系统中检索数据的时候。在这种情况下,您知道需要提出的问题以及回答问题所需的关系。
When we say analyze, we are referring to when you need to examine your data. Usually, teams spend time studying the relationships within their data with the goal of finding which relationships are important. This process is different from querying your graph data. Query refers to when you need to retrieve data from a system. In this case, you know the question you need to ask and the relationships required to answer the question.
让我们从向右移动的选项开始:当您知道最终应用程序需要存储和查询数据中的关系时。诚然,由于图行业的阶段和年龄,这是目前最不可能的路径。但在这些情况下,您已经准备好直接在应用程序中使用图数据库。
Let’s start with the option that moves to the right: the cases when you know your end application needs to store and query the relationships within your data. Admittedly, this is the least likely path today, due to the stage and age of the graph industry. But in these cases, you are primed and ready to move directly to using a graph database within your application.
通过合作,我们发现了一组常见的用例,其中需要数据库来管理图数据。这些用例是后续章节的主题,我们将把它们留到以后讨论。
From our collaborations, we have found a common set of use cases in which databases are needed to manage graph data. Those use cases are the topics of the upcoming chapters, and we will save them for later discussion.
然而,大多数情况下,团队知道他们的问题需要图数据,但他们不知道如何回答他们的问题或哪些关系很重要。这表明需要分析图数据。
Most often, however, teams know that their problems require graph-shaped data, but they do not know exactly how to answer their questions or which relationships are important. This is pointing toward needing to analyze your graph data.
从这里开始,我们挑战你和你的团队在这个旅程中再迈出一步。我们的要求是,你考虑一下分析图数据所能带来的成果。围绕图分析创建结构和目的有助于你的团队为你的基础设施和工具做出更明智的选择。这是图 1-4中提出的最后一个问题。
From here, we challenge you and your team to take one more step in this journey. Our request is that you think about the deliverables from analyzing your graph data. Creating structure and purpose around graph analysis helps your team make more informed choices for your infrastructure and tools. This is the final question posed in Figure 1-4.
图数据分析的主题范围很广,从理解关系中的特定分布到在整个结构中运行算法。这是连通分量、团伙检测、三角形计数、计算图的度分布、页面排名、推理器、协同过滤等算法的领域。我们将在接下来的章节中定义其中许多术语。
Topics in graph data analysis can range from understanding specific distributions across the relationships to running algorithms across the entire structure. This is the area for algorithms such as connected components, clique detection, triangle counting, calculating a graph’s degree distribution, page rank, reasoners, collaborative filtering, and many, many others. We will define many of these terms in upcoming chapters.
我们通常看到图谱算法结果的三个不同最终目标:报告、研究或检索。让我们深入研究一下每个选项的含义。
We most often see three different end goals for the results of a graph algorithm: reports, research, or retrieval. Let’s dig into what we mean by each of those options.
我们将详细介绍这三个选项(报告、研究和检索),因为这是当今大多数人使用图数据所做的事情。本书中其余的技术示例和讨论主要集中在您决定需要图数据库时。
We are going into detail for all three options (reports, research, and retrieval) because this is what most people are doing with graph data today. The remaining technical examples and discussion in this book are focused primarily on when you have decided you need a graph database.
首先,我们来谈谈报道。我们使用“报告”一词来指对业务数据进行智能和洞察的传统需求。这通常被称为商业智能 (BI)。虽然有争议地被误用,但许多早期图项目的可交付成果旨在为高管已建立的 BI 管道提供指标或输入。从图数据增强或创建商业智能流程所需的工具和基础架构值得专门写一本书来深入探讨。本书不关注 BI 问题的架构或方法。
First, let’s talk about reporting. Our use of the word reports refers to the traditional need for intelligence and insights into your business’s data. This is most commonly referred to as business intelligence (BI). While debatably misapplied, the deliverables of many early graph projects aimed to provide metrics or inputs into an executive’s established BI pipeline. The tools and infrastructure you will need for augmenting or creating processes for business intelligence from graph data deserve their own book and deep dive. This book does not focus on the architecture or approaches for BI problems.
在数据科学和机器学习领域,你会发现图算法的另一个常见用途:一般研究和开发。企业投资研发以发现其图数据中的价值。有几本书探讨了研究图结构数据所需的工具和基础设施;但这本书不是其中之一。
Within the realm of data science and machine learning, you find another common use of graph algorithms: general research and development. Businesses invest in research and development to find the value within their graph-shaped data. There are a few books that explore the tools and infrastructure you will need for researching graph-structured data; this book is not one of them.
这将我们带到最后一条路径,标记为“检索”。在图 1-4中,我们特别引用了那些向最终用户提供服务的应用程序。我们谈论的是服务于客户的数据驱动产品。这些产品对延迟、可用性、个性化等方面有期望。这些应用程序与旨在为内部受众创建指标的应用程序具有不同的架构要求。本书将在接下来的技术章节中介绍这些主题和用例。
This brings us to the last path, labeled “retrieval.” In Figure 1-4, we are specifically referencing those applications that provide a service to an end user. We are talking about data-driven products that serve your customers. These products come with expectations around latency, availability, personalization, and so on. These applications have different architectural requirements than applications that aim to create metrics for an internal audience. This book will cover these topics and use cases in the coming technology chapters.
回想一下我们提到过的 LinkedIn。如果你使用 LinkedIn,那么你很可能已经与图 1-4中我们能想到的描述“检索”路径的最佳示例之一进行过交互。LinkedIn 中有一项功能可以描述你与网络中其他人的联系方式。当你查看其他人的职业档案时,此功能会描述该人是一度、二度还是三度联系人。你与 LinkedIn 上其他人之间的联系长度可以告诉你有关你的职业网络的有用信息。LinkedIn 的此功能是数据产品的一个例子,它遵循图 1-4的检索路径,向最终用户提供上下文图指标。
Think back to our mention of LinkedIn. If you use LinkedIn, you have likely interacted with one of the best examples we can think of to describe the “retrieval” path in Figure 1-4. There is a feature in LinkedIn that describes how you are connected to any other person in the network. When you look at someone else’s professional profile, this feature describes whether that person is a 1st-degree, 2nd-degree, or 3rd-degree connection. The length of the connection between you and anyone else on LinkedIn tells you useful information about your professional network. This LinkedIn feature is an example of a data product that followed the retrieval path of Figure 1-4 to deliver a contextual graph metric to the end users.
这三条路径之间的界限可能很模糊。区别在于构建数据驱动的产品或需要获得数据洞察。数据驱动的产品为您的客户提供独特的价值。这些产品的下一波创新将是使用图数据来提供更相关、更有意义的体验。这些是我们想要在本书中探索的有趣问题和架构。
The lines between these three paths can be blurry. The difference lies between building a data-driven product or needing to derive data insights. Data-driven products deliver unique value to your customers. The next wave of innovation for these products will be to use graph data to deliver more relevant and meaningful experiences. These are the interesting problems and architectures we want to explore throughout this book.
有时你可能会对图 1-3和图 1-4中的问题回答“我不知道”——这是可以的。
Occasionally you may respond to the questions throughout Figure 1-3 and Figure 1-4 with “I don’t know”—and that is OK.
归根结底,您之所以阅读本书,可能是因为您的业务有数据和复杂的问题。这些问题庞大且相互依存。从问题的最高层次来看,我们在图 1-3和图 1-4中呈现的思维过程似乎与您的复杂数据脱节。
Ultimately, you are likely reading this book because your business has data and a complex problem. Such problems are vast and interdependent. At your problem’s highest level, navigating the thought process we are presenting throughout Figure 1-3 and Figure 1-4 can seem out of touch with your complex data.
然而,根据我们帮助世界各地数百个团队的集体经验,我们仍然建议您应该分解您的问题并再次循环整个过程。
However, drawing on our collective experience helping hundreds of teams around the world, our advice remains that you should break down your problem and cycle through the process again.
平衡高管利益相关者的需求、开发人员技能和行业需求极其困难。您需要从小事做起。在已知和经过验证的价值上打下基础,让您更接近解决复杂问题。
Balancing the demands of executive stakeholders, developer skills, and industry demands is extremely difficult. You need to start small. Build a foundation upon known and proven value to get you one step closer to solving your complex problem.
如果您忽略了做决定会发生什么?我们经常看到伟大的想法无法从研发过渡到生产应用:这是由来已久的分析瘫痪。运行图算法的目的是确定关系如何为数据驱动的应用程序带来价值。您需要就您在这方面花费的时间和资源做出一些艰难的决定。
What happens if you ignore making a decision? Too often, we have seen great ideas fail to make the transition from research and development to a production application: the age-old analysis paralysis. The objective of running graph algorithms is to determine how relationships bring value to your data-driven application. You will need to make some difficult decisions about the amount of time and resources you spend in this area.
了解企业数据的战略重要性与找到图技术是否适合您的应用程序的位置是同义的。为了帮助您确定图数据对您的业务的战略重要性,我们讨论了有关应用程序开发的四个非常重要的问题:
The path to understanding the strategic importance of your business’s data is synonymous with finding where (and whether) graph technology fits into your application. To help you determine the strategic importance of graph data for your business, we have walked through four very important questions about your application development:
您的问题需要图数据吗?
Does your problem need graph data?
数据中的关系能帮助你理解你的问题吗?
Do relationships within your data help you understand your problem?
您将如何处理数据中的关系?
What are you going to do with the relationships in your data?
您需要对图算法的结果做什么?
What do you need to do with the results of a graph algorithm?
将这些思维过程整合在一起,图 1-5将所有四个问题合并为一张图。
Bringing these thought processes together, Figure 1-5 combines all four questions into one chart.
我们花时间浏览整个决策树有两个原因。首先,决策树描绘了我们在构建、建议和应用图技术时使用的思维过程的完整图景。其次,决策树说明了本书的目的在图思维领域中的位置。
We spent time walking through the entire decision tree for two reasons. First, the decision tree depicts a complete picture of the thought process we use when we build, advise on, and apply graph technologies. Second, the decision tree illustrates where this book’s purpose fits into the space of graph thinking.
也就是说,本书可以作为您的指南,帮助您按照图 1-5中的路径进行图思维,最终需要一个图数据库。
That is, this book serves as your guide to navigating graph thinking in the paths throughout Figure 1-5 that end in needing a graph database.
如果得到适当的利用,您的企业数据可以成为一项战略资产和一项产生回报的投资。图在这里尤为重要,因为网络效应是一种强大的力量,可以提供绝佳的竞争优势。此外,当今的设计思维鼓励架构师将其业务数据视为需要以最大便利性和最低成本进行管理的东西。
When properly leveraged, your business’s data can be a strategic asset and an investment that yields a return. Graphs are of particular importance here since network effects are a powerful force that provides exquisite competitive advantage. Additionally, today’s design thinking encourages architects to view their business’s data as something that needs to get managed with maximal convenience and minimal cost.
这种思维方式要求我们重新思考如何处理和运用数据。
This mindset requires a rethinking of how we handle and work with data.
改变思维方式是一段漫长的旅程,任何旅程都始于一步。让我们一起迈出这一步,并学习我们将在旅途中使用的新术语。
Changing a mindset is a long journey, and any journey begins with one step. Let’s take that step together and learn the new set of terms we will be using along the way.
1 T. William Olle,《CODASYL 数据库管理方法》(英国奇切斯特:Wiley-Interscience,1978 年)。No. 04;QA76. 9. D3、O5。
1 T. William Olle, The CODASYL Approach to Data Base Management (Chichester, England: Wiley-Interscience, 1978). No. 04; QA76. 9. D3, O5.
2 Rudolph Bayer 和 Edward McCreight,《大型有序索引的组织和维护》,载于《软件先驱》,Manfred Broy 和 Ernst Denert 编辑(柏林:Springer-Verlag,2002 年),第 245-262 页。
2 Rudolph Bayer and Edward McCreight, “Organization and Maintenance of Large Ordered Indexes,” in Software Pioneers, ed. Manfred Broy and Ernst Denert (Berlin: Springer-Verlag, 2002), 245–262.
3 Edgar F. Codd,“大型共享数据库的关系数据模型”, ACM 通讯13,第 6 期 (1970):377-387。
3 Edgar F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM 13, no. 6 (1970): 377-387.
4 Clair Brown 和 Greg Linden,《芯片与变革:危机如何重塑半导体产业》(剑桥:麻省理工学院出版社,2011 年)。
4 Clair Brown and Greg Linden, Chips and Change: How Crisis Reshapes the Semiconductor Industry (Cambridge: MIT Press, 2011).
5 Geoffrey A. Moore 和 Regis McKenna,《跨越鸿沟》(纽约:HarperBusiness,1999 年)。
5 Geoffrey A. Moore and Regis McKenna, Crossing the Chasm (New York: HarperBusiness, 1999).
6 Everett M. Rogers,《创新的扩散》(纽约:西蒙舒斯特出版社,2010 年)。
6 Everett M. Rogers, Diffusion of Innovations (New York: Simon and Schuster, 2010).
多年来,我们共同为数百个团队提供了建议,告诉他们从何处以及如何开始使用图数据和图技术。通过与这些团队的对话,我们收集了最常见的问题和建议,以便将图思维和图数据引入您的业务。
Together over the years, we have advised hundreds of teams on where and how to get started with graph data and graph technologies. From our conversations with those teams, we assembled the most common questions and advice for introducing graph thinking and graph data into your business.
我们希望通过以下三个问题开始您的图思维之旅,每个团队在评估图技术时都会遇到这些问题:
We want to start your journey toward graph thinking with the following three questions that every team will encounter when evaluating graph technologies:
图技术比关系技术更适合解决我的问题吗?
Is graph technology better for my problem than relational technology?
我怎样才能将我的数据视为图?
How do I think about my data as a graph?
如何对图模式进行建模?
How do I model a graph’s schema?
那些花时间提前了解这三个主题的团队更有可能成功地将图技术集成到他们的堆栈中。相反,根据我们的经验,企业搁置了早期的图项目,因为他们的团队跳过了集体理解这些问题的过程。
Those teams that spend the time up-front to understand these three topics are more likely to successfully integrate graph technologies into their stack. Conversely, from our experience, businesses shelved early-stage graph projects because their teams skipped through collectively understanding these questions for their business.
开篇的三个问题构成了本章的提纲。
The three questions in the opening section form the outline of this chapter.
我们将首先简要介绍关系技术和图技术之间的差异。然后,我们将简要介绍关系数据建模。从模型中,我们将关系概念转化为图建模技术,并简要介绍图论中的一些基本术语。
We will start off with an abbreviated tour of the differences between relational and graph technologies. Then we will walk through an abbreviated tour of relational data modeling. From the model, we will translate the relational concepts to graph modeling techniques and take a short tour of some fundamental terms from graph theory.
我们还将介绍图模式语言(GSL),这是一种帮助您将可视化图模式转换为代码的语言(或工具)。我们创建了 GSL 来帮助您回答本章开头的问题 2 和 3。在本书中,我们将使用 GSL 作为教学工具将图转换为架构语句。
We will also introduce the Graph Schema Language (GSL), a language (or tool) that helps you translate a visual graph schema into code. We created the GSL to help you answer questions 2 and 3 from the beginning of the chapter. Throughout this book, we will use the GSL as a teaching tool to translate a diagram into schema statements.
不可避免地,您将不得不做出一些艰难的决定,决定是否、在何处以及如何将图思维和技术引入您的工作流程。在本章中,我们将介绍一些工具和技术,以帮助您驾驭大量的技术意见。我们在这里提供的基础知识将帮助您评估图技术是否是您下一个应用程序的正确选择。
Inevitably, you are going to have to make some tough decisions about whether, where, and how to introduce graph thinking and technology into your workflow. In this chapter, we are going to introduce tools and techniques to help you navigate a large pool of technical opinions. The foundations we provide here will help you evaluate whether graph technology is the right choice for your next application.
本章介绍的概念和技术决策将作为我们未来示例的基础材料。我们将利用本章清楚地说明我们将在本书的示例中用来描述图数据库架构和图数据的词汇。
The concepts and technology decisions introduced in this chapter will serve as the foundational material for our future examples. We are using this chapter to clearly illustrate the vocabulary that we will use to describe graph database schema and graph data in the examples throughout this book.
将图数据引入到您的应用程序中,会带来一种新的思维模式,让您思考数据中什么是重要的。要了解这些原则之间的差异,首先要将您的思维方式从关系思维转变为图思维。
The introduction of graph data into your application brings a new paradigm of thinking about what is important within your data. Understanding the differences in these principles starts with evolving your mindset from relational to graph thinking.
到目前为止,我们已经提到了两种不同的技术:关系技术与图技术。当我们谈论关系系统时,我们指的是以一种专注于存储和检索现实世界实体(例如人、地点和事物)的方式组织数据。当我们谈论图系统时,我们指的是专注于存储和检索关系的系统。这些关系代表了现实世界实体之间的联系:人们认识其他人、人们住在某个地方、人们拥有某些东西等等。
So far we have mentioned two different technologies: relational and graph. When we talk about relational systems, we are referring to organizing your data in a way that focuses on the storage and retrieval of real-world entities such as people, places, and things. When we talk about graph systems, we are referring to systems that focus on the storage and retrieval of relationships. These relationships represent the connections between real-world entities: people know people, people live in places, people own things, and so on.
这两个系统都可以表示实体和关系,但是是针对其中一个系统进行构建和优化的。
Both systems can represent entities and relationships alike, but are built and optimized for one over the other.
为您的应用程序选择关系系统还是图系统之间的界限是模糊的;每种选择都有优点和缺点。在关系数据库和图数据库之间进行选择通常会引发关于存储要求、可扩展性、查询速度、易用性和可维护性的讨论。虽然这种讨论的任何方面都值得讨论,但我们的目标是阐明更主观的标准:易用性和可维护性。
The line between selecting a relational system or selecting a graph system for your application is gray; each choice has benefits and drawbacks. Choosing between a relational database and a graph database typically generates a conversation about storage requirements, scalability, query speed, ease of use, and maintainability. While any aspect of such a conversation is worth discussing, we aim to shed light on the more subjective criteria: ease of use and maintainability.
尽管关系和关系这两个词非常相似,但我们明确使用它们来指代两种不同类型的技术。关系这个词描述的是一种数据库类型,如 Oracle、MySQL、PostgreSQL 或 IBM Db2。这些系统的创建是为了将特定的数学领域应用于数据组织和推理——即关系代数。另一方面,我们仅将关系这个词用于图数据和图技术。这些系统的创建是为了将不同的数学领域应用于数据组织和推理——即图论。
Even though the words relational and relationships are very similar, we use them explicitly to refer to two different types of technologies. The word relational describes a type of database, like Oracle, MySQL, PostgreSQL, or IBM Db2. These systems were created to apply a specific field of mathematics to data organization and reasoning—namely, relational algebra. On the other hand, we use the word relationship solely in reference to graph data and graph technologies. These systems were created to apply a different field of mathematics to data organization and reasoning—namely, graph theory.
在关系型和图技术之间做出选择可能很困难,因为您无法在特性功能层面对它们进行比较。它们的差异可以追溯到它们的核心,因为它们建立在不同的数学理论之上:关系系统使用关系代数,图系统使用图理论。这意味着每种技术的适用性在很大程度上取决于这些理论及其相关思路对您的问题的适用性。
Choosing between relational and graph technologies can be difficult because you cannot compare them at a feature-functionality level. Their differences can be traced to their cores as a result of their being built on distinct mathematical theories: relational algebra for relational systems and graph theory for graph systems. That means the suitability of each technology depends to a large degree on the applicability of those theories and their associated lines of thinking to your problem.
出于两个原因,我们将在以下部分中进一步探讨关系技术和图技术之间的差异。首先,由于大多数人都熟悉关系思维,我们可以引入图思维来与关系思维进行对比。其次,我们想回答一个不可避免的问题:“为什么不直接使用 RDBMS?”在理解图技术的背景下,探索这两个原因都很重要,因为关系系统非常成熟且被广泛采用。
We are going to drill a little further into the differences between relational and graph technologies in the following sections, for two reasons. First, since most people are familiar with relational thinking, we can introduce graph thinking in contrast to relational thinking. Second, we want to provide a response to the inevitable question, “Why not just use an RDBMS?” Both of these reasons are important to explore in the context of understanding graph technology because relational systems are very mature and widely adopted.
在本书中,我们将使用数据来说明概念、示例和新术语。让我们从本章中将使用的数据开始,以说明关系概念和图概念之间的差异。您将在第 3章、第 4章和 第 5 章的示例中看到这些数据。
Throughout this book, we will use data to illustrate concepts, examples, and new terminology. Let’s start with the data that we will be using in this chapter to illustrate the differences between relational and graph concepts. You will see this data in the example that spans Chapters 3, 4, and 5.
我们将使用表2-1中的数据来构建关系和图数据模型。
We will use the data in Table 2-1 to construct relational and graph data models.
对于我们的第一个用例,数据描述了金融服务行业中几个客户的资产。客户可以共享账户和贷款,但一张信用卡只能由一个客户使用。
For our first use case, the data describes several customers’ assets in the financial services industry. The customers can share accounts and loans, but a credit card can be used by only one customer.
让我们看一下几行数据。表 2-1显示了有关五位客户的数据。这五位客户及其数据将用于构建数据模型并说明本章和接下来三章中的新概念。
Let’s look at a few rows of the data. Table 2-1 displays data about five customers. These five customers and their data will be used to build data models and illustrate new concepts throughout this chapter and the next three chapters.
| 客户 ID | 姓名 | 帐户编号 | 贷款编号 | cc_编号 |
|---|---|---|---|---|
客户_0 customer_0 |
迈克尔 Michael |
acct_14 acct_14 |
贷款_32 loan_32 |
cc_17 cc_17 |
客户_1 customer_1 |
玛丽亚 Maria |
acct_14 acct_14 |
没有任何 none |
没有任何 none |
客户_2 customer_2 |
拉希卡 Rashika |
acct_5 acct_5 |
没有任何 none |
cc_32 cc_32 |
客户_3 customer_3 |
杰米 Jamie |
acct_0 acct_0 |
贷款_18 loan_18 |
没有任何 none |
客户_4 customer_4 |
艾丽娅 Aaliyah |
acct_0 acct_0 |
[贷款_18,贷款_80] [loan_18, loan_80] |
没有任何 none |
表 2-1中显示的五行样本数据中有五个唯一客户。其中一些客户共享账户或贷款,以说明我们在金融服务系统中通常看到的不同类型的用户。
There are five unique customers in the five rows of sample data shown in Table 2-1. Some of these customers share accounts or loans to illustrate different types of users we typically see in a financial services system.
例如,customer_0和customer1,或者 Michael 和 Maria,代表典型的父子关系;Michael 是父母,Maria 是孩子。关于customer_2Rashika 的数据表明他们是这项金融服务的唯一用户。我们通常在大型应用程序中看到这种类型的用户数量最高;像 Rashika 这样的客户只拥有客户独有的数据,不会与任何其他人共享。最后,customer_3和customer_4(Jamie 和 Aaliyah)共享一个账户和一笔贷款。这种类型的数据通常表明用户是加入其金融账户的合作伙伴。
For example, customer_0 and customer1, or Michael and Maria, represent a typical parent-child relationship; Michael is the parent, and Maria is the child. The data about customer_2, Rashika, indicates they are a sole user of this financial service. We usually see the highest volume of this type of user in large applications; customers like Rashika only have data that is unique to the customer and is not shared by anyone else. Last, customer_3 and customer_4 (Jamie and Aaliyah) share an account and a loan. This type of data typically indicates that the users are partners who have joined their financial accounts.
如果这是您公司的样本数据,想象一下您可能与同事就建模这些数据进行的对话。在这种情况下,您正在共享白板或其他说明工具,并试图绘制数据中的实体、属性和关系。无论您是否使用关系或图系统,您都可能会进行类似于图 2-1中的概念模型的讨论。
If this were your company’s sample data, imagine the conversation you might have with your coworker about modeling this data. In this scenario, you are sharing a whiteboard, or other illustrative tool, and you are trying to map out the entities, attributes, and relationships within the data. Whether or not you use a relational or graph system, you likely would be having a discussion similar to the conceptual model in Figure 2-1.
从表 2-1中,我们发现了四个主要实体:客户、账户、贷款和信用卡。这些实体都与客户有关系。客户可以有多个账户,而这些账户可以有多个客户。客户还可以有多笔贷款,而这些贷款可以有多个客户。最后,客户可以有多张信用卡,但每张信用卡都是针对一个客户的。
From Table 2-1, we find four main entities: customers, accounts, loans, and credit cards. These entities each have relationships tied to the customer. Customers can have multiple accounts, and those accounts can have more than one customer. Customers can also have multiple loans, and those loans can have more than one customer. Finally, customers can have multiple credit cards, but each credit card is unique to one customer.
从关系思维到图思维的转变始于数据建模。了解这两个系统中的数据建模开始说明为什么图技术更适合。
Your transition from relational to graph thinking starts with data modeling. Understanding data modeling in these two systems begins to illustrate why graph technologies can be a better fit.
对于任何数据库从业者来说,您可能已经了解过在关系系统中建模数据的可视化方式。创建关系数据模型最流行的选择是使用统一建模语言(UML) 或使用实体关系图(ERD)。
For anyone who has been a database practitioner, you’ve probably been introduced to visual ways of modeling data in a relational system. The most popular choices for creating relational data models are to use the unified modeling language (UML) or to use entity-relationship diagrams (ERDs).
在本节中,我们将使用表 2-1中的示例数据来完成使用 ERD 进行关系数据建模的简要介绍。我们在本节中包含的信息刚好足以提供从关系到图思维的第一步。这并非关系数据建模领域的完整介绍。我们推荐 C. Batini 等人撰写的经验丰富的书籍1,以了解有关关系数据建模的完整详细信息。对于那些非常熟悉第三范式的人,您可以跳过下一部分并直接前往“图模式语言”。
In this section, we will use the example data from Table 2-1 to complete an abbreviated walk-through of relational data modeling with an ERD. We have included just enough information in this section to provide a first step from relational to graph thinking. This is not intended to be used as a full introduction into the world of relational data modeling. We recommend the seasoned book by C. Batini et al.1 for complete details on relational data modeling. And for those of you who are very comfortable with third normal form, you can skip this next bit and head directly to “The Graph Schema Language”.
一般来说,数据建模技术通过描述数据中的实体及其属性来帮助您描述现实世界。每个概念都有特定的含义:
Generally speaking, data modeling techniques help you describe the real world by describing the entities and their attributes within your data. Each of those concepts has a specific meaning:
实体是您需要在数据库中跟踪的对象,例如人、地点或事物。
An entity is an object such as a person, place, or thing that you need to track in your database.
属性是指实体的性质,例如名称、日期或其他描述性特征。
An attribute refers to a property of an entity such as names, dates, or other descriptive features.
关系数据建模的传统方法首先是识别数据中的实体(人物、地点和事物)以及这些实体的属性(名称、标识符和描述)。实体可以是客户、银行账户或产品。属性是诸如人名或银行账户号之类的概念。
The traditional approach for relational data modeling starts with identifying the entities (people, places, and things) in your data and the attributes (names, identifiers, and descriptions) of those entities. Entities could be customers, bank accounts, or products. Attributes are concepts such as a person’s name or bank account number.
在本次数据建模练习中,我们首先从表 2-1开始对两个实体进行建模:客户和银行账户。在关系系统中,我们传统上将实体视为表。如图2-2所示。
For this exercise in data modeling, let’s start by modeling two entities from Table 2-1: customers and bank accounts. In a relational system, we traditionally view the entities as tables. This is illustrated in Figure 2-2.
图 2-2中显示了两个主要概念:实体及其各自的属性。此图中有两个实体:客户和账户。对于每个实体,都有一个描述该实体的属性列表。客户可以用唯一标识符、姓名、出生日期等来描述。账户也有描述性属性:唯一账户标识符和账户创建日期。
There are two main concepts shown in Figure 2-2: entities and their respective attributes. There are two entities in this diagram: customers and accounts. For each entity, there is a list of attributes that describe the entity. A customer can be described by a unique identifier, name, birthdate, and so on. There are also descriptive attributes for accounts: a unique account identifier and the date the account was created.
在关系数据库中,每个实体都成为一张表。表的行包含有关该实体的样本数据,每列包含描述性属性的值。
In a relational database, each entity becomes a table. The rows of the table contain sample data about that entity, and each column contains values for the descriptive attributes.
在现实世界中,客户拥有自己的账户。设计关系数据库的下一步是概念性地对这种联系进行建模。我们需要在模型中添加一种描述一个人如何拥有银行账户的方法。图 2-3显示了一种流行的从客户到账户的建模方法。
In the real world, customers own accounts. The next step in designing a relational database would be to conceptually model this connection. We need to add to our model a way to describe how a person owns a bank account. A popular method for modeling the link from customers to accounts is shown in Figure 2-3.
我们在图 2-2和图 2-3之间添加的一个视觉元素是连接人员和帐户实体表的菱形。此连接表示数据库中的客户和帐户之间存在链接。即客户拥有帐户。
One visual element that we added between Figure 2-2 and Figure 2-3 is the diamond that connects the person and account entity tables. This connection indicates that there is a link between customers and accounts in the database. Namely, customers own accounts.
该图像的其他视觉细节包括人员表和连接之间的双线owns。在这里我们看到一个n,连接m的另一侧有一个owns。此符号表示这是many-to-many客户和帐户之间的连接。具体来说,这意味着一个人可以拥有多个帐户,而一个帐户可以由多个客户拥有。
The image’s other visual details include the double lines between the person table and the owns connection. Here we see an n, with an m on the opposite side of the owns connection. This notation indicates that this is a many-to-many connection between customers and accounts. Specifically, this translates to the idea that one person can own many accounts and that one account can be owned by many customers.
关于实现细节的以下细微差别很重要:ERD 中显示的链接转换为表或外键。也就是说,客户与其帐户之间的联系以表的形式存储在关系系统中。这意味着该owns表本质上转换为数据库中的另一个实体。
The following nuance about the implementation details is important: links that are shown in ERDs translate to tables or foreign keys. That is, the connections between customers and their accounts are stored as a table within a relational system. This means that the owns table essentially translates to another entity in the database.
使用表格将数据中的连接表示为实体会使理解数据中的链接变得更加困难。从自然理解到表格检索的思维飞跃是一个需要克服的重大心理障碍。当您需要了解数据的连通性时尤其如此。
Using tables to represent the connections within your data as entities makes it more difficult to understand the links within your data. The mental leap from natural understanding to tabular retrieval is a significant mental hurdle to overcome. This is especially true when you need to understand the connectedness of your data.
尽管几十年来我们被迫这样思考,但还有更好的方法。
Even though we have been forced to think this way for decades, there are better ways.
让我们重新回顾表 2-1中的数据。不过,这次我们将使用这些数据来说明图数据中的概念,然后说明如何在图数据库中对数据进行建模。
Let’s revisit the data from Table 2-1. However, this time we are going to use the data to illustrate concepts in graph data, followed by how we will model the data in a graph database.
我们将使用本节介绍图论界的有用术语。这些术语用于描述图数据的连通性。让我们从样本数据中可视化前三个人的图数据。
We will use this section to introduce useful terminology from the graph theory community. These terms are used to describe the connectivity of the graph data. Let’s visualize the graph data about the first three people from our sample data.
图 2-4中可视化的数据将用于说明本节其余部分的基本概念。这些数据包含三个人的信息:Michael、Maria 和 Rashika。Michael 和 Maria 共享一个帐户,如图2-4所示。在我们的示例中,Rashika 不与另外两个客户共享任何数据。
The data visualized in Figure 2-4 will be used to illustrate the fundamental concepts in the rest of this section. This data contains information about three people: Michael, Maria, and Rashika. Michael and Maria share an account, as seen in Figure 2-4. Rashika does not share any data with the other two customers in our example.
我们需要介绍的第一个概念是图和图数据的基本元素及其定义。这些术语被图社区的所有成员使用,并被接受为图的基本元素。
The first concepts we need to introduce are the fundamental elements of graphs and graph data and their definitions. These terms are used across all members of the graph community and are accepted as the fundamental elements of a graph.
图是用两个不同元素来表示数据:顶点和边。
A graph is a representation of data with two distinct elements: vertices and edges.
顶点(复数 vertices)表示数据中的概念或实体。
A vertex (pl. vertices) represents a concept or entity in data.
边表示从一个顶点到另一个顶点的关系或链接。
An edge represents a relationship or link from one vertex to another.
您已经看到了我们正在讨论的基本要素。图 2-4中的财务数据包含四个概念实体:客户、账户、信用卡和贷款。这些实体自然转化为我们图的顶点。
You have already seen the fundamental elements we are talking about. Our financial data from Figure 2-4 contains four conceptual entities: customers, accounts, credit cards, and loans. These entities naturally translate into the vertices of our graph.
我们在本书中避免使用术语“节点”,因为我们关注的是分布式图,并且节点在分布式系统、图论和计算机科学中具有不同的含义。
We avoid the term nodes in this book because we are focusing on distributed graphs, and nodes has different meanings in distributed systems, graph theory, and computer science.
接下来,我们使用边来连接顶点。这些连接说明了数据片段之间存在的关系。在图数据中,边连接两个顶点,作为两个对象之间关系的抽象表示。
Next, we use edges to connect our vertices. These connections illustrate the relationships that exist between the pieces of data. In graph data, an edge connects two vertices as an abstract representation of a relationship between the two objects.
对于这些数据,我们将使用边来显示个人与其财务数据之间的关系。我们对数据进行建模,表明客户拥有账户、客户欠贷款以及客户使用信用卡。图数据库中的边成为owns、owes和 的关系uses。
For this data, we will use edges to show the relationship between a person and their financial data. We model the data to say that the customer owns accounts, the customer owes loans, and the customer uses credit cards. The edges in the graph database become the relationships of owns, owes, and uses.
数据中的所有顶点和边共同代表完整的graph。
Together, all of the vertices and edges in the data represent the full graph.
虽然图论中有许多基础主题可供探索,但首先要探索的术语是邻接。您会发现这个术语在整个图论中都用于讨论数据如何连接。本质上,邻接性是用于描述顶点是否相互连接的数学术语。正式定义如下:
While there are many foundational topics in graph theory to explore, the term to start with is adjacency. You will find this term used throughout graph theory to talk about how data is connected. Essentially, adjacency is the mathematical term used to describe whether vertices are connected to each other. Formally, it is defined as follows:
如果两个顶点通过边连接,则它们相邻。
Two vertices are adjacent if they are connected by an edge.
在图 2-4中,Maria与 相邻acct_14。此外,我们看到Michael和都Maria与 相邻,acct_14因为它们都拥有该帐户。当您能够看到不同实体如何以您以前可能未曾见过的方式相互关联时,在应用程序中使用图数据的好处会立即显现出来。
In Figure 2-4, Maria is adjacent to acct_14. Also, we see that both Michael and Maria are adjacent to acct_14 because they both own that account. The benefit to using graph data in your application is immediately apparent when you can see how different entities are related in a way that you may not have previously seen.
邻接的概念将在本书中多次出现,涉及从数据连通性到磁盘上的不同存储格式等主题。目前,重要的是只知道这个流行术语指的是顶点如何连接。
The idea of adjacency will come up many more times throughout this book, in topics ranging from the connectedness of data to different storage formats on disk. For now, it is important to know only that this popular term refers to how vertices are connected.
Data that is connected forms communities. In graph theory, these communities are called neighborhoods.
对于一个顶点v,所有与 相邻的顶点都v被称为在 的邻域内v,写为N(v)。邻域内的所有顶点都是 的邻居v。
For a vertex v, all vertices that are adjacent to v are said to be within the neighborhood of v, written N(v). All vertices within a neighborhood are neighbors of v.
图 2-5显示了从、开始的图邻域的概念。在此示例数据中,顶点、和直接连接到 或相邻。我们将其称为的第一个邻域。customer_0Michaelcc_17loan_32acct_14Michaelcustomer_0
Figure 2-5 shows the concept of graph neighborhoods starting from customer_0, Michael. In this sample data, the vertices cc_17, loan_32, and acct_14 are directly connected, or adjacent, to Michael. We call this the first neighborhood of customer_0.
customer_0您可以通过从起始顶点进一步走开来继续这个概念。第二个邻域由距离 两个边的那些顶点组成Michael;Maria位于 的第二个邻域中Michael。反过来说, 也行得通,即Michael位于 的第二个邻域中Maria。当我们从单个起点遍历顶点的整个深度时,这可以在整个图中继续下去。
You can continue this concept by walking further away from the starting vertex. The second neighborhood consists of those vertices that are two edges away from Michael; Maria is in the second neighborhood of Michael. It also works to say the reverse, that Michael is in the second neighborhood of Maria. This can continue on throughout the graph as we walk through the full depth of vertices from a singular starting point.
邻里的概念使我们认识到距离。讨论此样本数据的连通性的另一种方法是说出从一个顶点走到另一个顶点需要多少步。讨论Michael的第一或第二邻域与查找距离 为 1 或 2 的所有顶点相同Michael。
The concept of neighborhoods brings us to distance. Another way to talk about the connectedness of this sample data is to say how many steps it takes to walk from one vertex to another. Talking about Michael’s first or second neighborhood is the same as finding all vertices that are a distance of 1 or 2 from Michael.
在图数据中,距离是指从一个顶点到另一个顶点必须经过的边数。
In graph data, distance refers to the number of edges that you have to walk through to get from one vertex to another.
在图 2-5中,我们选择顶点 为起点Michael。顶点cc_17、loan_32和acct_14位于Michael的第一个邻域中,与 的距离相同,为 1 Michael。
In Figure 2-5, we selected the starting point as the vertex Michael. The vertices cc_17, loan_32, and acct_14 are in Michael’s first neighborhood, which is the same as a distance of 1 from Michael.
在数学界,你会看到它被写成dist(Michael, cc_17) = 1。这也意味着从你的起点开始的第二个邻域中的所有东西都有两个边的距离,依此类推——具体来说,dist(Michael, Maria) = 2。
In mathematical communities, you will see this written as dist(Michael, cc_17) = 1. That also means that everything in the second neighborhood from your starting point is two edges away, and so on—specifically, dist(Michael, Maria) = 2.
邻接、邻域和距离的概念有助于我们了解两部分数据是否相连。对于许多应用来说,了解一段数据与其相邻数据的连接程度尤其有用。
The ideas of adjacency, neighborhoods, and distance help us understand if two pieces of data are connected. For many applications, it is especially useful to understand how well a piece of data is connected to its neighbors.
数据是否连接以及连接程度之间的差异引入了数学界的一个新术语:度。
The difference between if and how well data is connected introduces a new term from the math community: degree.
顶点的度是与该顶点相邻(即接触)的边的数量。
A vertex’s degree is the number of edges that are incident to (i.e., touch) the vertex.
换句话说,我们讨论顶点的度时是根据与该顶点相切的边的数量。
In other words, we talk about a vertex’s degree in reference to the number of edges that touch that vertex.
在我们讨论接下来的例子时,回想一下图 2-4。在该图中,我们看到有三条边连接Michael到cc_17、loan_32和acct_14。我们应用这一点来表示 的度Michael为三,或deg(Michael) = 3。
Recall Figure 2-4 as we walk through the upcoming examples. In that figure, we see that there are three edges that connect Michael to cc_17, loan_32, and acct_14. We apply this to say that the degree of Michael is three, or deg(Michael) = 3.
在这个数据中,我们有两个顶点的度为 2。具体来说,与和acct_14相邻,因此它的度为 2。在图像的右侧,我们看到也只有两条边。这意味着 的度为 2。MichaelMariaRashikaRashika
In this data, we have two vertices that have a degree of two. Specifically, acct_14 is adjacent to Michael and Maria, so it has a degree of two. On the right side of the image, we see that Rashika also has only two edges. That means that Rashika has a degree of two.
我们的示例数据中共有 5 个顶点的度为 1。它们分别是loan_32、cc_17、Maria、cc_32和acct_5。
There are a total of five vertices in our example data that have a degree of one. They are loan_32, cc_17, Maria, cc_32, and acct_5.
在图论中,度为 1 的顶点称为叶子。
In graph theory, a vertex with a degree of one is called a leaf.
我们还根据边是从该特定顶点开始还是结束,将顶点的度数细分为两个子类别。让我们介绍两个描述这些类别的新术语。
We also break down a vertex’s degree into two subcategories according to whether the edge starts at or ends with that particular vertex. Let’s introduce two new terms that describe these categories.
顶点的入度是与该顶点相邻(或接触)的入边的总数。
A vertex’s in-degree is the total number of incoming edges that are incident to (or touch) the vertex.
顶点的出度是与该顶点相邻(或接触)的出边的总数。
A vertex’s out-degree is the total number of outgoing edges that are incident to (or touch) the vertex.
让我们将这些定义应用到我们刚刚讨论过的例子。
Let’s apply these definitions to the examples we just walked through.
的三条边都Michael始于Michael并终止于其他顶点:cc_17、loan_32和acct_14。因此,我们说 的出度为Michael三,因为所有三条边都是出度。
All three of Michael’s edges start at Michael and end at other vertices: cc_17, loan_32, and acct_14. Therefore, we say that the out-degree of Michael is three because all three of the edges are outgoing.
的入度为acct_142,因为它有来自Michael和 的入边Maria。 的两条Rashika边都是出边,所以我们说 的Rashika出度为 2。
The in-degree of acct_14 is two because it has incoming edges from Michael and Maria. Both of Rashika’s edges are outgoing edges, so we say that Rashika has an out-degree of two.
的入度为 1,因为该边是入边; 、和 也是cc_17如此。的出度为 1,因为其一条边是 的出边。loan_32cc_32acct_5Mariaacct_14
The in-degree of cc_17 is one because the edge is incoming; the same is true for loan_32, cc_32, and acct_5. Maria has an out-degree of one because its one edge is an outgoing edge to acct_14.
数据科学家和图论家使用顶点的度来了解图数据中的连接类型。开始的地方之一是找到图内连接最紧密的顶点。
Data scientists and graph theorists use a vertex’s degree to understand the type of connections found within the graph data. One of the places to start is to find the most highly connected vertices within your graph.
根据应用情况,度数非常高的顶点可以被视为枢纽或影响力很大的实体。
Depending on the application, vertices of very high degree can be thought of as hubs or highly influential entities.
找到这些高度连接的顶点很有用,因为当它们在图数据库中存储或查询时,性能会产生影响。对于图数据库从业者来说,度极高的顶点(>100,000 条边)被称为超节点。
It is useful to find these highly connected vertices because there are performance ramifications when they are stored or queried in a graph database. To a graph database practitioner, vertices of extremely high degree (>100,000 edges) are known as supernodes.
就本节而言,我们想说明如何在应用程序中应用和解释图结构。我们将在第 9 章中详细介绍高度连接的顶点的性能,其中我们正式定义超节点,推断它们对数据库的影响,并逐步介绍缓解策略。
For the purposes of this section, we want to illustrate how to apply and interpret graph structure in your application. We will get into the performance details of highly connected vertices in Chapter 9, where we formally define supernodes, reason about their influence in your database, and step through mitigation strategies.
回想一下,我们以三个问题开启了本章。到目前为止,一切都解决了前两个问题。本章的最后一节介绍了一个工具,教您如何通过将可视化图转换为代码来建模图模式。让我们开始吧。
Recall that we opened this chapter with three questions. Everything up until now addressed the first two of those questions. The final section of this chapter introduces a tool to teach you how to model graph schema by translating visual diagrams into code. Let’s dive in.
图从业者、学者和工程师普遍就说明图数据的术语和方法达成一致。然而,技术和学术界对这些术语的使用令人困惑。有些词对图数据库从业者来说是一种含义,而对图数据科学家来说则有不同的含义。
Graph practitioners, academics, and engineers have generally agreed on terms and methods for illustrating graph data. However, the terms are used confusingly across the technical and academic communities. There are words that mean one thing to a graph database practitioner and something different to a graph data scientist.
为了解决社区中的困惑,我们在本书中引入并正式化了用于描述图模式的术语。这种语言称为图模式语言 (GSL)。GSL 是一种用于应用概念来创建图数据库模式的可视化语言。
To address confusion across the communities, in this book we are introducing and formalizing terminology to describe graph schema. This language is called the Graph Schema Language, or GSL. The GSL is a visual language for applying concepts to create graph database schema.
我们创建了 GSL 作为本书示例中使用的教学工具。我们创建、引入和使用 GSL 的目的是规范图从业者交流概念图模型、图模式和图数据库设计的方式。对我们来说,这套术语和视觉插图是对学术界推广的图语言和图社区内标准化举措的补充。
We created the GSL as a teaching tool to use throughout the examples in this book. Our purpose in creating, introducing, and using the GSL is to normalize how graph practitioners communicate conceptual graph models, graph schema, and graph database design. To us, this set of terminology and visual illustrations complements the graph languages popularized by the academic community and the standardization initiatives within the graph community.
我们将在本书使用的概念图模型中使用本节中描述的视觉提示和术语。我们希望接下来的许多示例能够成为将视觉示意图转换为架构代码的良好实践。
We will use the visual cues and terminology described in this section throughout the conceptual graph models used in this book. We hope the many upcoming examples are good practice for translating a visual schematic into schema code.
图数据的基本元素(顶点和边)为我们提供了图模式语言 (GSL) 的第一个术语:顶点标签和边标签。关系模型使用表来描述图 2-3中的数据,而我们使用顶点标签和边标签来描述图的模式。
The fundamental elements of graph data—vertices and edges—give us the first terms of the Graph Schema Language (GSL): vertex labels and edge labels. Where relational models use tables to describe the data in Figure 2-3, we use vertex labels and edge labels to describe a graph’s schema.
顶点标签是一组语义上同质的对象。也就是说,顶点标签表示一类具有相同关系和属性的对象。
A vertex label is a set of objects that are semantically homogeneous. That is, a vertex label represents a class of objects that share the same relationships and attributes.
边标签命名数据库模式中顶点标签之间的关系类型。
An edge label names the type of relationship between vertex labels in your database schema.
在图建模中,我们用顶点标签标记每个实体,并用边标签描述实体之间的关系。
In graph modeling, we label each entity with a vertex label and describe the relationship between entities with an edge label.
一般而言,顶点标签描述数据中具有相同类型属性和相同标签关系的实体。边标签描述顶点标签之间的关系。
Generally speaking, vertex labels describe entities in your data that share attributes of the same type and relationships of the same label. Edge labels describe relationships between vertex labels.
术语“顶点”和“边”用于指代数据。为了描述数据库的架构,我们使用术语“顶点标签”和“边标签”。
The terms vertices and edges are used in reference to data. To describe a database’s schema, we use the terms vertex labels and edge labels.
对于表 2-1中的数据,我们将使用图 2-6中所示的概念图模型对同一客户和帐户进行建模。此概念图模型看起来与图 2-3中的 ERD 非常相似,但转换为使用 GSL 的前两个术语。
For the data from Table 2-1, we would model the same customer and account with the conceptual graph model shown in Figure 2-6. This conceptual graph model looks very similar to the ERD from Figure 2-3, but with the translation to using the first two terms of GSL.
在 GSL 中,顶点标签用一个圆圈表示,圆圈内包含标签的名称。图 2-6Customer显示了和顶点标签的图示Account。边标签用两个顶点标签之间的命名线表示。我们在图 2-6owns中看到,边标签位于和Customer顶点标签之间。当我们查看此图示时,我们推断客户与帐户有关系;具体来说,客户拥有该帐户。稍后我们将在“边方向”Account中讨论边的方向。
In the GSL, a vertex label is illustrated with a circle containing the label’s name. Figure 2-6 shows this for the Customer and Account vertex labels. An edge label is illustrated with a named line between two vertex labels. We see this in Figure 2-6 with the owns edge label between the Customer and Account vertex labels. When we look at this illustration, we infer that a customer has a relationship to an account; specifically, the customer owns the account. We will get into an edge’s direction later, in “Edge Direction”.
在图 2-3中,关系模型使用属性来描述数据,而在图建模中,属性则用来描述数据。也就是说,以前我们使用的是属性,而在图模型中,我们使用的是属性。
Where relational models use attributes to describe the data in Figure 2-3, properties describe data in graph modeling. That is, where we had attributes before, we have properties in a graph model.
属性描述顶点标签或边标签的特征,例如名称、日期或其他描述性特征。
A property describes features of a vertex label or an edge label, such as names, dates, or other descriptive features.
在图 2-7中,每个顶点标签都有一个与之关联的属性列表。这些属性与图 2-3中的关系 ERD 中的属性相同。客户顶点由其唯一标识符、姓名和出生日期描述。与前面一样,帐户由其帐户 ID 和创建日期描述。我们添加一个边标签owns来描述此数据模型中两个实体之间的关系。
In Figure 2-7, each vertex label has a list of properties associated with it. These properties are the same attributes from the relational ERD in Figure 2-3. A customer vertex is described by its unique identifier, name, and birthdate. An account is described with its account ID and created date, as before. We add an edge label of owns to describe the relationship between the two entities in this data model.
请注意,术语“属性”适用于图模式和图数据中的概念。
Note that the term property applies to concepts in graph schema and graph data.
GSL 中定义的下一个建模概念是边的方向。当我们在数据模型中设置边标签时,我们会将顶点标签连接在一起,这种方式符合我们自然谈论数据的方式;我们会说一个客户拥有一个帐户,然后我们用这种方式在图中对数据进行建模。这也赋予了每条边一个方向。
The next modeling concept defined in the GSL is an edge’s direction. When we set up our edge labels in our data model, we connected the vertex labels together in a way that follows how we would naturally talk about the data; we would say that a customer owns an account, and we modeled the data in our graph that way. This also gives each edge a direction.
There are two ways in which you can model the direction of an edge: directed and bidirectional.
有向边只有一个方向:从一个顶点标签到另一个顶点标签。
A directed edge goes one way: from one vertex label to the other vertex label.
A bidirectional or bidirected edge goes in both directions between the vertex labels.
使用 GSL,我们在边线的一端或两端用箭头指示边的方向。图 2-8中用边标签下方的箭头说明了这一点。
Using GSL, we indicate the direction of an edge with an arrow at either one end or both ends of the edge line. This is illustrated in Figure 2-8 with the arrow below the edge label.
我们可以认为图 2-8中的示例显示了一个有向边标签。我们有一个从客户到帐户的边标签。此边标签使用有向边来模拟客户拥有帐户。
We would say that our example in Figure 2-8 shows a directed edge label. We have an edge label that goes from the customer to an account. This edge label uses a directed edge to model that a customer owns an account.
另一方面,以相反的方向对数据进行建模可能会很有用。一种方法是添加从帐户到客户的第二条有向边。该边表示该帐户归客户所有,如图2-9所示。我们说图 2-9中的所有边都是有向的,因为它们只朝一个方向。
On the other hand, it might be useful to model our data in the opposite direction. One way to do this is by adding a second directed edge from the account to the customer. This edge indicates that the account is owned by the customer, like in Figure 2-9. We say that all of the edges in Figure 2-9 are directed because they go only one way.
边标签的方向来自于我们如何交流数据。描述数据时,使用主语、谓语(即动词)和宾语来交流域。要了解这一点,请考虑如何描述我们在本章迄今为止一直使用的示例数据。你可能会想“客户拥有账户”或“银行账户归客户所有”。在第一个短语中,主语是“客户”,谓语是“拥有”,宾语是“账户”。这给了我们一个源顶点标签Customer和一个目标顶点标签;谓语“拥有”转换为边标签,并具有从客户到其帐户的方向。我们可以对第二个短语执行类似的过程,从到Account派生出“owned_by”的边标签。AccountCustomer
An edge label’s direction comes from how we communicate about our data. When you describe your data, you use a subject, a predicate (i.e., a verb), and an object to communicate about your domain. To see this, consider how you would describe the sample data we have been using so far in this chapter. You likely are thinking something like “customers own accounts” or “bank accounts are owned by customers.” In the first phrase, the subject is “customers,” the predicate is “owns,” and the object is “accounts.” This gives us a source vertex label, Customer, and a destination vertex label, Account; the predicate “owns” translates to an edge label and has a direction from the customer to their account. We can follow a similar process for the second phrase to derive an edge label of “owned_by” from Account to Customer.
粗略地说,描述中的主语、谓语和宾语的标识创建了边标签的方向。主语是第一个顶点标签,也是边的起点。在 GSL 中,我们将其称为域。然后谓词成为你的边标签。最后,对象是边标签的目的地或范围。这意味着边标签来自域并进入范围。这给了我们两个新术语:
Loosely speaking, the identification of the subject, predicate, and object of your description creates an edge label’s direction. The subject is the first vertex label and is where an edge starts. In the GSL, we call this the domain. Then the predicate becomes your edge label. Last, the object is the destination or range of an edge label. This means the edge label comes from the domain and goes to the range. This gives us two new terms:
边标签的域是边标签起源或开始的顶点标签。
The domain of an edge label is the vertex label from which the edge label originates or starts.
边标签的范围是边标签指向或结束的顶点标签。
The range of an edge label is the vertex label to which the edge label points or ends.
本节要说明的最后一个概念是双向边。对于我们到目前为止讨论的数据,拥有双向的边标签在语义上并不完全合理。也就是说,说客户拥有一个帐户是有意义的,但说“一个帐户拥有一个客户”是没有意义的。我们必须将边标签改为“一个帐户由一个客户拥有”。
The last concept to illustrate in this section is a bidirectional edge. For the data we have talked about so far, it doesn’t exactly make semantic sense to have an edge label that goes in both directions. That is, it is meaningful to say that a customer owns an account, but it doesn’t make sense to say “an account owns a customer.” We have to change the edge label to say “an account is owned by a customer.”
为了更好地说明双向边标签,让我们在示例中添加客户之间的关系。具体来说,让我们添加一个连接家庭成员客户的边标签。这是一个更好的说明双向边标签的示例,如图2-10所示。
To best illustrate an edge label that is bidirectional, let’s add relationships between customers into our example. Specifically, let’s add an edge label that connects customers who are family members. This is a better example to illustrate a bidirectional edge label and is shown in Figure 2-10.
在这个模型中,我们指出客户可以是其他客户的家庭成员。我们将这种关系解释为互惠关系:如果你是别人的家庭成员,他们也是你的家庭成员。我们在 GSL 中使用两端带箭头的线对此进行建模,并称此边标签为双向或双向的。
In this model, we are indicating that customers can be family members of other customers. We interpret this type of relationship as a reciprocal relationship: if you are a family member to someone else, they are also your family member. We model this in the GSL using a line with an arrow on both ends and say that this edge label is bidirectional or bidirected.
在图论中,双向边与无向边是一回事。也就是说,在两个方向上对关系进行建模本质上与没有任何特定方向相同。然而,在本书的上下文中,我们使用数据之间的关系为应用程序提供含义,因此必须考虑边的方向。
In graph theory, a bidirected edge is the same thing as an undirected edge. That is, modeling a relationship in both directions is essentially the same as not having any specific direction. However, in the context of this book, we are using relationships between data to provide meaning to an application and therefore have to consider an edge’s direction.
第一次遇到方向时,它可能是一个难以理解的概念。在图开发中,思考方向的最佳方法之一来自于您如何谈论数据。我们建议创建数据描述并确定如何解释其中的关系。这有助于在心理上将您对数据关系的概念理解转化为边缘标签的方向。
When first encountering direction, it can be a tricky concept to wrap your head around. In graph development, one of the best ways to think about direction comes from how you speak about your data. We recommend creating a description of your data and identifying how you would explain the relationships within it. This helps mentally translate your conceptual understanding of your data’s relationships into an edge label’s direction.
在没有明确指出的情况下,我们在图 2-10中引入了一个新概念,现在我们想对其进行定义。如果一条边的起点和终点位于同一个顶点标签上,我们就称其为自引用边标签。在 GSL 中,我们会将其绘制并标注为如图 2-11所示。
Without explicitly calling it out, we introduced a new concept in Figure 2-10 that we would like to define now. If an edge starts and ends on the same vertex label, we say that is a self-referencing edge label. In the GSL, we would draw and notate this as seen in Figure 2-11.
自引用边标签是指边标签的域和范围是相同的顶点标签。
A self-referencing edge label is where the edge label’s domain and range are the same vertex label.
图 2-11是说明以相同顶点标签开始和结束的边标签的正确方法。我们称这是一个自引用边标签。在图 2-11的情况下,它也是一个双向边标签。但是,并非所有自引用边标签都是双向的。
Figure 2-11 is the correct way to illustrate an edge label that starts and ends on the same vertex label. We say that this is a self-referencing edge label. In the case of Figure 2-11, it is also a bidirected edge label. However, not all self-referencing edge labels are bidirected.
您将在下一章中看到一个有向自引用边标签的示例。当您需要对递归关系进行建模时,就需要这种情况 — — 具体来说,当某事物包含于其他事物中时,或者当您具有父子关系时。
You will see an example of a directed, self-referencing edge label in an upcoming chapter. This is the case when you need to model a recursive relationship—specifically, when something is contained within something else, or when you have a parent-child relationship.
当您开始深入研究使用图进行数据建模时,您可能需要一种方法来指示图中不同顶点标签之间可以存在多少种关系。
When you start diving into data modeling with graphs, you will probably want a way to indicate how many relationships can exist between different vertex labels in your graph.
关于这个话题我们有一些好消息。在大多数图模型中,描述关系数量只有一个选项:多个。
We have some good news on this topic. There is only one option for describing the number of relationships in most graph models: many.
在 DataStax Graph 和大多数其他图数据库中,所有边标签都代表多对多关系。这意味着,任何顶点都可以通过特定的边标签与许多其他顶点相连。在 ERD 中,这称为many-to-many;在 UML 中使用 0..* 到 0..*。有时,这也被称为m:n关系社区中的关系。
In DataStax Graph and in most other graph databases, all edge labels represent many-to-many relationships. Meaning, any vertex can have many other vertices connected to it by a particular edge label. In an ERD, this is called many-to-many; in UML you use 0..* to 0..*. Sometimes, this is also referred to as an m:n relationship within the relational community.
我们使用术语多重性来描述这一概念:
We use the term multiplicity to describe this concept:
多重性是一组可采用的允许大小范围的规范。也就是说,多重性描述了沿着特定边标签与给定顶点相邻的顶点组可采用的允许大小范围。2
Multiplicity is a specification of the range of allowable sizes that a group may assume. Namely, multiplicity describes the range of allowable sizes that the group of vertices adjacent to a given vertex along a particular edge label may assume.2
集合或集合的实际大小称为基数。基数定义为特定集合或集合中元素的有限数量。
The actual size of the set or collection is referred to as cardinality. Cardinality is defined as the finite number of elements in a particular set or collection.
为了正确性和清晰度,我们在讨论图模式中的边标签建模时专门使用术语“多重性”。让我们更深入地探讨一下在图模型中应用多重性定义时可用的两个选项。
For correctness and clarity, we exclusively use the term multiplicity when talking about modeling edge labels within a graph schema. Let’s dig a little deeper into the two options available when you apply the definition of multiplicity in your graph model.
将多重性应用于图模式归结为理解可能存在的不同类型的相邻顶点组。只有两种:一组或一个集合。
The application of multiplicity to your graph’s schema comes down to understanding the different kinds of groups of adjacent vertices that are possible. There are only two: a set or a collection.
集合是一种存储唯一值的抽象数据类型。
A set is an abstract data type that stores unique values.
集合是一种存储非唯一值的抽象数据类型。
A collection is an abstract data type that stores nonunique values.
在一组相邻顶点中,一个顶点的实例只能有一个。在一组相邻顶点中,一个顶点的实例可以有很多个。我们在图 2-12中说明了这两个概念之间的区别。
In a set of adjacent vertices, there can be only one instance of a vertex. In a collection of adjacent vertices, there could be many instances of a vertex. We illustrate the difference between these concepts in Figure 2-12.
图 2-12左侧的图显示,相邻的顶点组Michael是一个集合:{acct_0}。这意味着我们希望数据库中的客户和帐户之间最多有一条边。图 2-12右侧的图显示,相邻的顶点组Michael是一个集合:[acct_0, acct_0]。这意味着我们希望数据库中的客户和帐户之间有多条边。需要多条边的一个例子是,当您想要表示客户既是帐户的管理员又是帐户的用户时。
The graph on the left in Figure 2-12 shows that the group of vertices adjacent to Michael is a set: {acct_0}. This means that we want to have at most one edge between a customer and an account in our database. The graph on the right in Figure 2-12 shows that the group of vertices adjacent to Michael is a collection: [acct_0, acct_0]. This means that we want to have many edges between a customer and an account in our database. An example of when you would like many edges is when you want to represent that a customer is both administrator and user of an account.
最有可能需要决定多重性类型的场合是当您在边缘上建模时间时。您是否只想要数据库中最新的边缘?那么您可以将边缘视为一个集合。您是否想要随时间变化的所有边缘?那么您可以将边缘视为一个集合。我们将在第7章、第 9 章和第12 章中介绍边缘上的时间。
The most likely occasion for needing to decide about the multiplicity type is when you are modeling time on your edges. Do you want only the most recent edge in your database? Then you are thinking of your edge as a set. Do you want all of your edges over time? Then you are thinking of your edges as a collection. We will cover time on edges in Chapters 7, 9, and 12.
让我们看看如何在 GSL 中模拟图 2-12中两个图之间的差异。
Let’s look at how we would model the differences between the two graphs from Figure 2-12 in the GSL.
图 2-13显示了我们如何使用 GSL 中的一条线来表示我们希望两个顶点之间最多有一条边。
Figure 2-13 shows how we use a single line in the GSL to illustrate that we want to have at most one edge between two vertices.
为了能够对两个顶点之间的多条边进行建模,我们需要一种方法使一条边与另一条边不同。在图 2-12中,我们role向边添加了属性,以便每条边都不同。图 2-14显示了我们如何在 GSL 中使用双线和属性值来说明我们希望两个顶点之间有多条边。
In order to be able to model many edges between two vertices, we need a way for one edge to be different from another edge. In Figure 2-12, we added the role property to the edge so that each edge is different. Figure 2-14 shows how we use a double line and a property value in the GSL to illustrate that we want to have many edges between two vertices.
理解多重性的诀窍在于理解你的数据。如果你需要在两个顶点之间有多条边(因为你需要将连接的顶点组作为一个集合而不是一个集合),那么你需要在边上定义一个属性,使其具有唯一性。
The trick to understanding multiplicity lies in understanding your data. If you need to have multiple edges between two vertices—because you need the group of connected vertices to be a collection rather than a set—then you need to define a property on the edge that makes it unique.
使用GSL,表2-1中的数据可以转换成图2-15所示的概念图模型。
Using GSL, the data from Table 2-1 translates into the conceptual graph model shown in Figure 2-15.
我们参考以概念图模型的形式呈现在图 2-15中。此模型创建图数据库模式。此概念图模型显示了一个客户和与该客户相关的三个不同数据。这四个实体转换为四个单独的顶点标签:、、和。CustomerAccountLoanCreditCard
We refer to the image in Figure 2-15 as a conceptual graph model. This model creates your graph database schema. This conceptual graph model shows a customer and three different pieces of data related to the customer. These four entities translate to four separate vertex labels: Customer, Account, Loan, and CreditCard.
这四部分数据以三种方式相关:客户拥有账户、使用信用卡和欠贷款。这在概念图模型中创建了三个边标签:分别owns为uses、和owes。所有三个边标签都是有向的;此示例中没有双向边标签。此外,我们看到边标签uses和owes在两个顶点之间最多有一条边,而owns可以有多条边。
These four pieces of data are related in three ways: customers own accounts, use credit cards, and owe loans. This creates three edge labels in the conceptual graph model: owns, uses, and owes, respectively. All three edge labels are directed; there are no bidirectional edge labels in this example. Further, we see that the edge labels uses and owes will have at most one edge between two vertices, whereas owns can have many edges.
图 2-15中要探索的最后一部分是每个顶点标签上显示的属性。这些是我们可以在表 2-1的数据中找到的属性。ACustomer有两个属性:customer_id和name。Account、CreditCard和Loan顶点标签分别只有一个属性:acct_id、cc_num和loan_id。在每种情况下, 属性都是数据的唯一标识符。
The final piece of Figure 2-15 to explore is the properties shown on each vertex label. These are the properties that we can find in the data from Table 2-1. A Customer has two properties: customer_id and name. The Account, CreditCard, and Loan vertex labels each have only one property: acct_id, cc_num, and loan_id, respectively. In each case, the property is the unique identifier for the data.
理解图 2-15和我们在图 2-4中展示的实例数据之间的区别非常重要。图 2-15显示了使用 GSL 的数据库模式的概念图模型。图 2-4显示了图数据库中的数据。
It is important to understand the difference between Figure 2-15 and the instance data we showed in Figure 2-4. Figure 2-15 shows the conceptual graph model for your database schema using GSL. Figure 2-4 shows what the data will look like in your graph database.
关系技术与图技术最难的评估在于技术之间的交织数据库建模与数据分析。我们想针对每个主题给出一些注释,以便您能够更有效地进行评估。
The most difficult evaluations of relational versus graph technologies are those that intertwine techniques for database modeling with those of data analysis. We want to conclude with a few notes on each of those topics to set you up for more effective evaluation processes.
图数据建模与关系数据建模非常相似;主要区别在于对实体之间关系的考虑。图技术针对关系优先的数据组织进行了优化,以便提供对数据库中实体关系的直接访问。鉴于此,如果实体之间的关系是数据的最重要特征,您将需要探索图技术。
Graph data modeling is very similar to relational data modeling; the main difference is in the consideration of relationships between entities. Graph technology is optimized for relationship-first data organization so as to provide direct access to an entity’s relationships in the database. Given this, you will want to explore graph technology if the relationships between your entities are the most important features of your data.
与关系技术相比,图技术的创建是为了最大限度地减少从思维模型到数据存储和检索的转变。使用图技术,概念数据模型就是实际的物理数据模型。也就是说,您不必专门进行任何物理数据建模,因为图数据库仅基于逻辑模型优化存储和物理布局。这是通过将顶点的边存储在可直接访问与顶点关联的边的结构中来实现的。
In contrast to relational technology, graph technology was created to minimize the transition from mental model to data storage and retrieval. With graph technologies, the conceptual data model is the actual physical data model. That is, you don’t have to specifically do any physical data modeling, as the graph database optimizes the storage and physical layout based on the logical model alone. This is achieved by storing the edges for a vertex in structures that give direct access to the edges associated to a vertex.
根据我们的经验,从思维模型到数据存储的更短转换是架构师从关系技术转向图技术的主要原因之一。使用图技术时,您可以绘制一个图像,既代表数据的概念理解,也代表数据的物理组织。从概念到物理数据组织的这种更短的解释创造了一种更强大的方式来设想、讨论和应用数据中的关系。如果没有图思维和技术,这以前是无法实现的。
In our experience, the shorter translation from mental model to data storage is one of the primary reasons architects are turning from relational to graph technologies. When using graph technology, you can draw one image that represents both the conceptual understanding and the physical organization of your data. This shorter interpretation from conceptual to physical data organization creates a more powerful way to envision, discuss, and apply the relationships within your data. Without graph thinking and technology, this was previously unachievable.
应用图论增强了在应用程序中使用图技术的吸引力。图技术让您能够了解数据是否连接以及连接程度。具体而言,诸如邻域和度等概念可以开启对数据的全新理解,而这是关系技术无法实现的。
Applied graph theory empowers the appeal of using graph technology in your application. Graph technology gives you the means to understand both if and how well your data is connected. Specifically, concepts such as neighborhoods and degree open up to a new understanding about your data that is not possible with relational technologies.
图模式和图数据之间的细微差别非常重要。向您的团队介绍图技术需要学习新术语、概念和应用程序。避免自己受阻的最有效方法之一是了解哪些概念适用于数据库建模,哪些概念适用于应用程序级别的数据分析。
The nuances between the worlds of graph schema and graph data are very important. Introducing graph technology to your team comes with a learning curve of new terms, concepts, and applications. One of the most effective ways to keep yourselves from being blocked is to understand which concepts apply to database modeling and which apply to data analysis at the application level.
我们从经验中了解到,团队经常混淆图数据分析和图模式的概念。我们认为,交换图模式和图数据的术语等同于混淆以下两个概念:饼图和外键约束。
We have learned from our experiences that teams often confuse concepts about graph data analysis and graph schema. As we see it, interchanging terms from graph schema and graph data is the same as confusing the following two concepts: pie charts and foreign key constraints.
让我们来解析一下。
Let’s unpack that.
关系技术非常适合用于设置数据库,以创建报告和数据摘要,例如饼图。饼图之类的东西可以直观地显示有关数据的指标。应用程序(饼图)与关系架构设计完全不同且不相关,例如在表之间选择外键约束。
Relational technologies are great for setting up a database that creates reports and summaries of data, like pie charts. Something like a pie chart visualizes a metric about the data. The application (the pie chart) is a completely different, and unrelated, concept from relational schema design, like selecting foreign key constraints between tables.
关系数据库的一个应用是创建饼图,并且数据库的模式需要使用外键进行设计才能实现。
An application of your relational database is to create pie charts, and the database’s schema requires designing with foreign keys to make it possible.
在开始使用图技术时,这种区别同样适用。设置图数据库后,您可以使用它来了解数据的连通性。具体来说,您可以找到图中两个顶点之间的距离。这是在应用程序级别,并使用数据来了解数据内的连接。这可以通过创建具有顶点标签、边标签和属性的图数据库模式来实现。
When getting started with graph technology, this same distinction applies. After you have set up your graph database, you use it to understand the connectivity of your data. Specifically, you can find the distance between two vertices in your graph. This is at the application level and uses data to understand the connections within the data. This is made possible by creating a graph database schema with vertex labels, edge labels, and properties.
图数据库的一个应用是计算顶点之间的距离,数据库的模式需要使用边标签和顶点标签进行设计才能实现这一点。
An application of your graph database is to calculate the distance between vertices, and the database’s schema requires designing with edge labels and vertex labels to make that possible.
这里重要的一点是了解创建数据库模式和分析图数据之间的区别。
The important takeaway here is to understand the differences between creating database schema and analyzing graph data.
到目前为止,图思维的洪流已经带来了许多术语和复杂性。在本章中,我们希望清楚地描述创建数据库模型的技术以及一些用于分析图数据的技术。
Up until now, the flood of graph thinking has introduced many waves of terminology and complexity. In this chapter, we hope to have clearly delineated the techniques for creating database models as well as a few for analyzing graph data.
本章旨在翻译多个社区使用的概念和术语。我们的目标是为以下三个目标提供背景和信息:
This chapter set out to translate the concepts and terms that are used across multiple communities. Our goals for this content were to provide background and information for the following three objectives:
图技术比关系技术更适合解决我的问题吗?
Is graph technology better for my problem than relational technology?
我怎样才能将我的数据视为图?
How do I think about my data as a graph?
如何对图模式进行建模?
How do I model a graph’s schema?
根据我们的经验,这三个问题是开发团队在评估其应用程序堆栈的图技术时讨论的主要话题。
In our experience, these three questions are the primary topics of conversation within development teams that are evaluating graph technology for their application stack.
我们选择本章中的内容作为回答这些问题所需的最小主题集。本章中的术语和主题代表了理解关系或图系统中的数据建模、图数据和应用程序设计的起点。结合 GSL,本章中的基础概念代表了您开始使用图技术所需了解的内容。此时,您已经掌握了开始第一个应用程序设计和评估所需的术语和概念。
We selected the content in this chapter as the minimum topic set needed to answer these questions. The terms and topics in this chapter represent the starting point for understanding data modeling, graph data, and application design in relational or graph systems. Combined with the GSL, the foundational concepts throughout this chapter represent what you need to know to get started with graph technology. At this point, you are equipped with the terminology and concepts you need to begin your first application design and evaluation.
诚然,我们没有给你太多答案来回答我们的第一个问题。这是因为我们无法直接为你回答。你的团队对应用程序的图技术的需求取决于本章中介绍的概念和术语的适用性。归根结底:如果关系对你的数据很重要,那么图技术将是你团队的正确答案。只有你才能决定你的数据。
We admittedly haven’t given you much that answers our first question. This is because we can’t answer it directly for you. Your team’s need for graph technology for your application comes down to the applicability of the concepts and terms presented throughout this chapter. To boil it down: if relationships matter to your data, then graph technology will be the right answer for your team. Only you can determine that about your data.
另一方面,我们可以帮助您了解如何在特定用例中使用关系或图技术。下一章将引导您了解一个常见的起始用例,团队通常会在该用例中测试关系与图技术。事不宜迟,让我们从公司已成功构建的基础开始,作为在您的业务中使用图数据的门户:客户的单一视图。
On the other hand, we can help you navigate the use of relational or graph technology for specific use cases. The next chapter will walk you through a common starting use case in which teams typically put relational versus graph technologies to the test. Without further ado, let’s start with the foundation that companies have successfully built as the gateway to using graph data in your business: the single view of your customer.
1 Carlo Batini、Stefano Ceri 和 Shamkant B. Navathe,《概念数据库设计:实体关系方法》,第 116 卷(加利福尼亚州雷德伍德城:Benjamin/Cummings,1992 年)。
1 Carlo Batini, Stefano Ceri, and Shamkant B. Navathe, Conceptual Database Design: An Entity-Relationship Approach, vol. 116 (Redwood City, CA: Benjamin/Cummings, 1992).
2 James Rumbaugh、Ivar Jacobson 和 Grady Booch,《统一建模语言参考手册》,第 2 卷(马萨诸塞州雷丁:Addison-Wesley,1999 年)。
2 James Rumbaugh, Ivar Jacobson, and Grady Booch, The Unified Modeling Language Reference Manual, vol. 2 (Reading, MA: Addison-Wesley, 1999).
我们经常看到技术团队在讨论大多数大型企业面临的数据问题时了解图思维的好处:试图从不同的数据源中提取价值。站在白板前勾勒出问题不可避免地会产生一幅复杂的图。
We have often seen technical teams understand the benefits of graph thinking in the context of discussing a data problem that most large businesses face: trying to extract value across disparate data sources. Standing at a whiteboard sketching out the problem inevitably produces one hairy graph.
您可以想象同样的场景。您正在白板上画图,并积极讨论系统数据如何分布在公司系统的不同孤岛中。您的团队同意,他们真正需要的是直接访问您的客户及其数据。为了说明这一点,几乎每次,您的同事都会将客户画在白板的中心,并将相关数据与客户联系起来。退后一步后,你们都意识到您的同事只是画了一张图。
You can imagine this same scenario. You are drawing at a whiteboard and actively discussing how your system’s data is spread in different silos across the company’s systems. Your team agrees that what it really needs is direct access to your customers and their data. To illustrate this, almost every time, your coworker draws the customer at the center of the whiteboard and connects the relevant data to the customer. After stepping back, you all realize your colleague just drew a graph.
根据我们的经验,这些白板练习说明了使用图思维构建数据管理解决方案的强大功能。图应用程序从数据管理开始,因为无论是从概念上还是从物理上讲,以前的技术选择都迫使我们将图数据塑造成表格解决方案。问题是,表格形式的数据不再是当今应用程序的通用设计。
In our experience, these whiteboarding exercises illustrate the power of using graph thinking to build a data management solution. Graph applications start with data management because, either conceptually or physically, previous technology choices forced us to shape graph data into tabular solutions. The problem is that tabular-shaped data is no longer a one-size-fits-all design for today’s applications.
对于那些必须满足用户个性化需求的应用程序来说尤其如此。个性化需求的不断增长给数据可用性和相关性带来了自上而下的压力。这种压力迫使组织整合不同的数据,并确保数据将用户与他们的数字体验联系起来。
This is especially true for those applications that have to cater to the user’s demand for personalized context. The rising demand for personalization has put top-down pressure on data availability and relevancy. This pressure has forced organizations to integrate disparate data and ensure that data ties users to their digital experience.
当团队聚在一起重新设计系统以实现个性化时,他们遇到了一个新问题。单一系统如何统一数据、实时运行并将数据与最终用户关联起来?现有的关系工具非常适合那些需要数据很好地适应行列格式的流程部分。
When teams huddle around the drawing board to re-architect their systems to deliver personalization, they encounter a new problem. How does a single system unify data, function in real time, and relate the data back to the end user? Existing relational tools are great for those parts of the process that require the data to fit well in a row-and-column format.
然而,关系工具并不适合传递某些形状的数据——特别是深度连接的数据。
However, relational tools are not well suited for delivering certain shapes of data—specifically, deeply connected data.
在白板会议的这个阶段,我们已经讨论到了一个重要的话题:确定和比较解决方案。解决方案设计过程通常会引入多种技术。关于选择哪种技术的后续争论可能会引起分歧,而且永无止境。
At this point in the whiteboarding session, we have reached a significant discussion topic: identifying and comparing solutions. The solution design process often introduces multiple technologies. The subsequent debate around which technology to choose can be divisive and never-ending.
To address this common pressure point, the main goals of this chapter are as follows:
定义并形式化图数据的通用起始应用程序
Define and formalize a common starting application for graph data
使用关系和图技术构建示例应用程序架构
Build out an example application architecture with relational and graph technologies
提供根据系统需求做出正确选择的指南
Provide a guide for making the right choice for your system’s needs
在本章的其余部分,我们将介绍和激发我们刚刚在白板故事中描述的用例。之后,我们将逐步介绍此类应用程序的实现细节,从关系系统开始。然后,我们将对图系统执行相同的过程。我们将以如何选择最适合您的应用程序的技术的讨论结束本章。正确理解这一点将帮助您在看似循环且永无止境的争论中找到根源,即何时、何地以及如何应用图技术来解决您的数据管理需求。
Throughout the rest of this chapter, we are going to introduce and motivate the use case that we just described in the whiteboard story. Afterwards, we are going to step through the implementation details of this type of application, starting with a relational system. Then we will follow the same process with a graph system. We will close this chapter with a discussion of how to select which technology is best for your application. Getting this right will help you find roots in the seemingly circular and never-ending debate over when, where, and how to apply graph technology to resolve your data management needs.
正如我们在白板故事中说明的那样,世界各地的技术团队都意识到图数据在解决数据管理问题方面的实用性。对于此类问题,新旧解决方案之间的区别在于对数据中的关系进行建模、存储和检索的实用性。
As we illustrated with the whiteboard story, tech teams all over the world are realizing the utility of graph data to solve their data management problems. For this type of problem, the difference between old and new solutions lies in the usefulness of modeling, storing, and retrieving relationships within your data.
旨在关注数据中的关系的应用程序面临的初始挑战是跨关系系统转换和统一数据。这种转变要求我们重新组织我们的思维和流程,从组织实体到组织关系。与之前的白板绘图类似,根据关系组织数据的新方法通常非常接近图 3-1中的方法。
Applications that aim to focus on the relationships in their data have the initial challenge of transforming and unifying data across relational systems. This transformation requires us to reorganize our thinking and processes from organizing entities to organizing relationships. Similar to the whiteboard drawing from earlier, the new approach to organizing your data according to its relationships is usually very close to what we have in Figure 3-1.
图技术的采用者们各自将这种类型的解决方案命名为客户 360应用程序,通常缩写为 C360。C360 项目的愿景(如图 3-1所示)是围绕业务中重要实体之间的关系设计应用程序。
Adopters of graph technologies independently converged on naming this type of solution a Customer 360 application, commonly abbreviated C360. The vision of a C360 project, like what is illustrated in Figure 3-1, is to engineer an application around the relationships between the important entities in your business.
您可以设想 C360 应用程序的目标;它有一个中心对象,即您的客户,以及客户与其他完整数据的关系。这些数据很可能与您的业务领域最相关。通常,我们看到团队从客户的家庭、付款方式或重要的身份信息开始。金融服务中的这个特定应用程序旨在回答有关您的客户的以下类型的问题:
You can envision the goal of a C360 application; there is a central object, your customer, and the customer’s relationships to other integral pieces of data. These pieces of data are likely those that are most relevant to your business’s domain. Commonly, we see teams start with the customer’s family, methods of payment, or important identification details. This particular application within financial services is designed to answer the following types of questions about your customer:
该客户使用哪些信用卡?
Which credit cards does this customer use?
该客户拥有哪些帐户?
Which accounts does this customer own?
该客户欠哪些贷款?
Which loans does this customer owe?
我们对这个客户了解多少?
What do we know about this customer?
将消费者数据统一到单个应用程序中的想法并不新鲜。现有的解决方案(例如数据仓库或数据湖)提供了存储消费者数据的单个系统。这里的问题不在于企业数据的集成,而在于其可访问性。图思维时代让我们重新审视这些解决方案,以寻找一种方法使这些数据更可用并更能代表个人的体验。
The idea to unify consumer data into a single application is not new. Existing solutions such as data warehouses or data lakes provide single systems in which consumer data is stored. The problem here isn’t in the integration of a business’s data but in its accessibility. The era of graph thinking made us revisit these solutions in search of a way to make this data more available and representative of the individual’s experience.
这样想想:你宁愿花一天时间钓鱼,还是想快点吃到晚餐?
Think of it this way: would you rather spend the day fishing, or would you like to get quick access to your dinner?
钓鱼和订晚餐之间的区别类似于将数据放入数据湖或组织数据以便快速检索。当今数字应用程序的需求要求架构师专注于快速交付数据。图技术允许架构师构建深度连接的检索系统,以补充跨数据湖的更长时间的搜索探索。
The difference between fishing for or ordering your dinner is similar to putting your data in a data lake or organizing your data for quick retrieval. The demands of today’s digital applications require architects to focus on the quick delivery of data. Graph technologies allow architects to build deeply connected retrieval systems to complement longer search expeditions across data lakes.
消费者以全渠道的方式与您的公司互动;他们可以从您的移动或网络应用程序无缝过渡到您的社交媒体源和实体店面。 在所有这些渠道中,客户都会对您的品牌产生综合的认知。通过创建统一的数字体验来满足这一期望的公司的收入增长高达 10%。据测算,这些收入增长速度是未统一客户数字体验的公司收入增长速度的两到三倍。1
Consumers interact with your company in an omnichannel fashion; they seamlessly transition from your mobile or web applications to your social media feeds and physical storefronts. Across all of these channels, they are experiencing an integrated perception of your brand. Companies that match this expectation by creating a unified digital experience are seeing revenue increases of up to 10%. These revenue increases are measured to be two to three times faster than those of companies that have not unified their customers’ digital experiences.1
收入增长背后的秘密在于一款整合所有客户数据的应用程序。将所有客户的数据整合到一个应用程序中,可以反映出每位客户对您品牌的体验。换句话说,这是一个 C360 应用程序。
The secret ingredient behind this observed revenue growth is an application that unifies all customer data. Bringing together all of your customers’ data into an application mirrors each customer’s experience with your brand. In other words, it is a C360 application.
早期创新者部署了有趣的 C360 应用程序,其创意案例数不胜数。其中一个比较独特的案例来自百度(中国的谷歌)和肯德基。通过统一的数据平台,百度与肯德基联手提供订单推荐。他们的协作解决方案可以识别客户、访问他们的订单历史记录并返回订单推荐。事实证明,这两个行业之间的数据集成是 C360 技术的一个独特且有利可图的例子。
There are myriad creative examples of early innovators who have deployed interesting C360 applications. One of the more unique examples comes from Baidu (the Google of China) and Kentucky Fried Chicken. Through a unified data platform, Baidu teamed up with KFC to deliver order recommendations. Their collaborative solution identifies customers, accesses their order history, and returns order recommendations. This integration of data across these two industries has proven to be a unique and profitable example of C360 technologies.
C360 应用程序是将图思维应用于您的业务的起点。正确实施此应用程序为将图数据引入系统架构奠定了坚实的基础。我们发现,架构师和系统设计师最常犯的错误之一是使用图技术从概念模型过快转向实现细节。这里还有更多需要考虑的地方,我们希望利用我们的经验指导您完成本章其余部分的评估。
A C360 application is the starting place for implementing graph thinking in your business. Getting this right provides a solid foundation for introducing graph data into your system’s architecture. We have found that one of the most common mistakes made by architects and system designers is to move too quickly from the conceptual model to implementation details with graph technologies. There is more to consider here, and we want to use our experience to guide you through your own evaluation throughout the rest of this chapter.
本节的目标是简要介绍如何构建关系系统来存储 C360 数据。本节并非对此类系统架构的完整介绍。相反,我们的目标是介绍了解使用关系系统进行 C360 应用程序的复杂性所需的最低限度。
The goal of this section is to briefly introduce how to build out a relational system to store the C360 data. This section does not serve as a complete introduction to this class of system architecture. Rather, our goal is to introduce the minimum needed to understand the complexities of using a relational system for a C360 application.
为了说明从数据建模到查询的整个过程中,我们将使用表 2-1中介绍的相同数据。为方便起见,我们在本章中再次分享数据表 - 参见表3-1 。要完全复习这些数据的生成、含义和细节,请参阅“我们的运行示例的数据”中的讨论。
To illustrate the process from data modeling through queries, we will be using the same data introduced in Table 2-1. For convenience, we are sharing the table of data again in this chapter—see Table 3-1. For a complete refresher on the generation, meaning, and details of this data, refer to the discussion in “Data for Our Running Example”.
| 客户 ID | 姓名 | 帐户编号 | 贷款编号 | cc_编号 |
|---|---|---|---|---|
客户_0 customer_0 |
迈克尔 Michael |
acct_14 acct_14 |
贷款_32 loan_32 |
cc_17 cc_17 |
客户_1 customer_1 |
玛丽亚 Maria |
acct_14 acct_14 |
没有任何 none |
没有任何 none |
客户_2 customer_2 |
拉希卡 Rashika |
acct_5 acct_5 |
没有任何 none |
cc_32 cc_32 |
客户_3 customer_3 |
杰米 Jamie |
acct_0 acct_0 |
贷款_18 loan_18 |
没有任何 none |
客户_4 customer_4 |
艾丽娅 Aaliyah |
acct_0 acct_0 |
[贷款_18,贷款_80] [loan_18, loan_80] |
没有任何 none |
我们将使用 SQL 和 Postgres 两种技术来说明关系实现。SQL 代表结构化查询语言,是用于与关系数据库通信的编程语言。我们选择使用 Postgres RDBMS,因为它具有广泛的适用性并且起源于开源社区。
The two technologies we will be using to illustrate a relational implementation are SQL and Postgres. SQL, which stands for Structured Query Language, is the programming language used to communicate with a relational database. We have chosen to use the Postgres RDBMS because of its wide applicability and origins within the open source community.
在就概念模型达成一致之后(如图2-1所示),您就可以继续设计关系数据库了。传统上,您会创建实体关系图(ERD)。ERD 是数据模型的逻辑表示,是关系数据库设计的典型起点。
After agreeing on a conceptual model, like that shown in Figure 2-1, you can move on to the design of your relational database. Traditionally, you would create an entity-relationship diagram, or ERD. An ERD is a logical representation of your data model and is a typical starting point for a relational database design.
在图 3-2中,每个方块代表一个将成为关系数据库中表的实体。每个方块内列出了每个实体的属性或描述性属性。如数据所示,每个实体都有一个唯一标识符。客户由其唯一标识,帐户由其唯一标识,等等。客户还有姓名,在大型应用程序中,还有其他属性。customer_idacct_id
In Figure 3-2, each square represents an entity that will become a table in the relational database. The attributes, or descriptive properties about each entity, are listed within each square. As already seen in the data, each entity will have a unique identifier. A customer will be uniquely identified by its customer_id, an account by its acct_id, and so on. Customers also have names and, in larger applications, other attributes.
图 3-2中实体之间的菱形表示从一个实体到另一个实体的连接。连接的基数在菱形的上方和下方或左侧和右侧表示。在此数据中,我们有两种类型的连接:一对多和多对多。
The diamond shapes between entities in Figure 3-2 represent the connection from one entity to another. The cardinality of the connection is indicated above and below or to the left and right of the diamond shape. In this data, we have two types of connections: one-to-many and many-to-many.
让我们从客户和信用卡之间的一对多连接开始。在此示例中,一个客户可以有多张信用卡,但一张信用卡只能有一个客户。这种一对多连接描述了客户和信用卡之间的基数,如图 3-2所示。n to 1
Let’s start with the one-to-many connection that is shown between customers and credit cards. In this example, a customer can have many credit cards, but a credit card can only have one customer. This one-to-many connection describes the cardinality between customers and credit cards and is illustrated with the n to 1 connection between customers and credit cards in Figure 3-2.
我们在数据中看到的另一种连接类型是多对多连接。我们的数据中有两种多对多连接:客户与账户和客户与贷款。从数据中我们知道,一个客户可以拥有多个账户,一个账户可以拥有多个客户。贷款也是如此。我们称客户与贷款为多对多连接,并在图 3-2n to m中用连接的符号来说明这一点。
The other type of connection we see in our data is a many-to-many connection. There are two many-to-many connections in our data: customers to accounts and customers to loans. We know from our data that a customer can have many accounts, and one account can have many customers. The same is true for loans. We say that customers to loans is a many-to-many connection and illustrate this in Figure 3-2 with the n to m notation on the connection.
在创建表和插入数据之前,我们需要将逻辑数据模型转换为物理数据模型。具体来说,我们需要将图 3-2所示的 ERD 中的实体和连接转换为具有主键和外键的表。
Before creating tables and inserting data, we need to translate our logical data model into a physical data model. Specifically, we need to translate the entities and connections from the ERD illustrated in Figure 3-2 into tables with primary and foreign keys.
对于此实现,我们需要两种类型的键:主键和外键。主键是唯一标识的数据,例如客户的 ID 或信用卡号,我们将使用它来访问表中的信息。外键是一段唯一标识数据,我们将使用它来访问不同表中的信息,例如将客户的 ID 与信用卡信息一起存储。我们将客户的 ID 与客户的信用卡信息一起存储,以便我们可以使用它在另一个表(即客户表)中查找他们的所有信息。
We need two types of keys for this implementation: primary and foreign keys. A primary key is a uniquely identifying piece of data, such as a customer’s ID or credit card number, that we will use to access the information in its table. A foreign key is a uniquely identifying piece of data that we will use to access the information in a different table, such as storing a customer’s ID alongside their credit card information. We store a customer’s ID with the customer’s credit card information so that we can use it to look up all of their information in a different table, namely in the customer table.
让我们看一下图 3-2中的键和数据如何映射到图 3-3所示的物理数据模型中。
Let’s take a look at how the keys and data from Figure 3-2 map into the physical data model shown in Figure 3-3.
我们期望在图 3-3中看到至少四个表— — 每个实体一个表。具体来说,图 3-3 中每个实体类型都有一个表:客户、账户、贷款和信用卡。对于每个表,我们都可以看到描述实体的附加属性。每个实体最重要的属性是其主键。每个主键在行旁边用 表示PK。每个表的主键分别是customer_id、acct_id、loan_id和cc_num。这些是我们将用来查找表中特定信息行的唯一标识符。
We expected to see at least four tables in Figure 3-3—one table for each entity. Specifically, Figure 3-3 has one table per entity type: customer, account, loan, and credit card. For each of those tables, we see additional attributes that describe the entity. The most important attribute for each of these entities is its primary key. Each primary key is indicated with a PK next to the row. The primary keys we have for each table are the customer_id, acct_id, loan_id, and cc_num, respectively. These are the unique identifiers that we will use to look up a specific row of information in the table.
在讨论图 3-3中的其他两个表之前,让我们先检查一下该CreditCard表。此表既有主键又有外键。我们在此表中使用外键来跟踪我们在 ERD 中创建的一对多关系。是外键customer_id(用 表示FK),它使我们能够将信用卡信息关联回唯一客户。在物理数据模型中构建一对多关系可以像添加外键以将您重新连接到另一个实体表一样简单。
Before we talk about the other two tables in Figure 3-3, let’s examine the CreditCard table. This table has both a primary key and a foreign key. We are using a foreign key in this table to track the one-to-many relationship we created in our ERD. The customer_id is the foreign key (indicated with an FK) that will give us the ability to relate the credit card information back to a unique customer. Building a one-to-many relationship into your physical data model can be as easy as adding a foreign key to join you back to another entity table.
在我们的物理数据模型中要理解的最后两个表是Owns和Owes表。这些表是连接表,使我们能够在数据中物理存储多对多连接。该Owns表存储观察到的客户与其拥有的账户之间的许多连接。该Owes表存储观察到的客户与其所欠贷款之间的许多连接。由于每个客户只能拥有一个账户,并且只能欠一次贷款,因此这些连接表的主键是两个外键的组合。
The last two tables to understand in our physical data model are the Owns and Owes tables. These tables are join tables that allow us to physically store the many-to-many connections in our data. The Owns table stores the many connections observed between customers and the accounts that they own. The Owes table stores the many connections observed between customers and the loans that they owe. Since each customer can own an account only once and can owe a loan only once, the primary key of these join tables is a compound of the two foreign keys.
例如,该Owns表存储了表中每行的至少两条信息:客户的唯一标识符和帐户的唯一标识符。给定此表中的一行,我们可以访问客户的唯一标识符以重新连接到客户表,也可以访问帐户的唯一标识符以重新连接到帐户表。此连接表是在关系系统中表示多对多连接的常用方法。
For example, the Owns table stores at least two pieces of information about each row in the table: the customer’s unique identifier and the account’s unique identifier. Given one row from this table, we can access both the customer’s unique identifier to join back to the customer table and the account’s unique identifier to join back to the account table. This join table is a common way to represent a many-to-many connection in a relational system.
给定我们的物理数据模型,让我们逐步创建表并将表 3-1中的示例数据插入到表中。
Given our physical data model, let’s walk through creating the tables and inserting our sample data from Table 3-1 into the tables.
首先,我们要创建客户表。最终的数据模型如图3-4所示。
First, we want to create the customer table. The final data model for this is shown in Figure 3-4.
创建客户表的SQL语句是:
The SQL statement for creating the customer table is:
CREATETABLECustomers(customer_idTEXT,nameTEXT,PRIMARYKEY(customer_id));
CREATETABLECustomers(customer_idTEXT,nameTEXT,PRIMARYKEY(customer_id));
我们的数据中有 5 位客户。让我们将这 5 位客户插入到此客户表中:
Our data has five customers. Let’s insert those five customers into this customer table:
INSERTINTOCustomers(customer_id,name)VALUES('customer_0','Michael'),('customer_1','Maria'),('customer_2','Rashika'),('customer_3','Jamie'),('customer_4','Aaliyah');
INSERTINTOCustomers(customer_id,name)VALUES('customer_0','Michael'),('customer_1','Maria'),('customer_2','Rashika'),('customer_3','Jamie'),('customer_4','Aaliyah');
我们关系数据库中的数据现在有一个包含五个条目的表,如图3-5所示。
Our data in our relational database now has one table with five entries, as shown in Figure 3-5.
接下来我们添加另外三个实体表,分别为Accounts、Loans和。它们的最终数据模型如图3-6CreditCards所示。
Next, let’s add the other three entity tables for Accounts, Loans, and CreditCards. Their final data models are shown in Figure 3-6.
We will start by creating the two tables for Accounts and Loans:
CREATETABLEAccounts(acct_idTEXT,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(acct_id));CREATETABLELoans(loan_idTEXT,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(loan_id));
CREATETABLEAccounts(acct_idTEXT,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(acct_id));CREATETABLELoans(loan_idTEXT,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(loan_id));
Next, let’s insert the data for Accounts and Loans:
INSERTINTOAccounts(acct_id)VALUES('acct_0'),('acct_5'),('acct_14');INSERTINTOLoans(loan_id)VALUES('loan_18'),('loan_32'),('loan_80');
INSERTINTOAccounts(acct_id)VALUES('acct_0'),('acct_5'),('acct_14');INSERTINTOLoans(loan_id)VALUES('loan_18'),('loan_32'),('loan_80');
此时,我们需要在关系数据库中创建最后一个实体表: 表CreditCards。由于信用卡与客户之间存在一对多关系,因此我们还需要插入客户的 ID 作为外键。我们使用以下命令创建此表:
At this point, we have one last entity table to create in our relational database: the table for CreditCards. Because credit cards have a one-to-many relationship with customers, we also need to insert the customer’s ID as a foreign key. We create this table with:
CREATETABLECreditCards(cc_numTEXT,customer_idTEXTNOTNULL,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(cc_num),FOREIGNKEY(customer_id)REFERENCESCustomers(customer_id));
CREATETABLECreditCards(cc_numTEXT,customer_idTEXTNOTNULL,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(cc_num),FOREIGNKEY(customer_id)REFERENCESCustomers(customer_id));
回顾表 3-1中的数据,我们发现每张唯一的信用卡以及拥有该卡的客户。根据这些信息,我们可以创建以下语句将这些数据插入到关系数据库中:
Looking back at the data from Table 3-1, we find each unique credit card and the customer who owns that card. From this information, we can create the following statements to insert this data into our relational database:
INSERTINTOCreditCards(cc_num,customer_id)VALUES('cc_17','customer_0'),('cc_32','customer_2');
INSERTINTOCreditCards(cc_num,customer_id)VALUES('cc_17','customer_0'),('cc_32','customer_2');
我们的关系数据库现在共有四个包含数据的表;见图3-7。
Our relational database now has a total of four tables with data; see Figure 3-7.
Customers在我们的关系实现中要创建的最后两个表是用于从到Accounts和的多对多连接的表Loans。首先,让我们创建将客户与帐户连接的表,如图3-8所示。
The last two tables to create in our relational implementation are those for the many-to-many connections from Customers to Accounts and Loans. First, let’s create the table that will join Customers to Accounts, as illustrated in Figure 3-8.
在 SQL 中,我们使用以下命令创建该表:
In SQL, we create this table with:
CREATETABLEOwns(customer_idTEXTNOTNULL,acct_idTEXTNOTNULL,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(customer_id,acct_id),FOREIGNKEY(customer_id)REFERENCESCustomers(customer_id),FOREIGNKEY(acct_id)REFERENCESAccounts(acct_id));
CREATETABLEOwns(customer_idTEXTNOTNULL,acct_idTEXTNOTNULL,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(customer_id,acct_id),FOREIGNKEY(customer_id)REFERENCESCustomers(customer_id),FOREIGNKEY(acct_id)REFERENCESAccounts(acct_id));
回想一下表 3-1中的数据,我们找到以下要插入到Owns表中的数据:
Remembering the data from Table 3-1, we find the following data to insert into the Owns table:
INSERTINTOOwns(customer_id,acct_id)VALUES('customer_0','acct_14'),('customer_1','acct_14'),('customer_2','acct_5'),('customer_3','acct_0'),('customer_4','acct_0');
INSERTINTOOwns(customer_id,acct_id)VALUES('customer_0','acct_14'),('customer_1','acct_14'),('customer_2','acct_5'),('customer_3','acct_0'),('customer_4','acct_0');
现在我们的Owns表中有了一些数据(见图3-9),我们可以看到如何在数据中关联客户数据和帐户数据。
Now that we have some data in our Owns table (see Figure 3-9), we can see how to associate the data from a customer and an account in our data.
创建关系数据库的最后一步是创建Owes关联表客户与贷款之间的关联,反之亦然。此连接表的最终数据模型如图 3-10所示。
The final step in creating our relational database is to create the Owes table to associate a customer to their loan, and vice versa. The final data model for this join table is shown in Figure 3-10.
在 SQL 中,我们通过以下方式在关系数据库中创建此表:
In SQL, we create this table in our relational database via:
CREATETABLEOwes(customer_idTEXTNOTNULL,loan_idTEXTNOTNULL,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(customer_id,loan_id),FOREIGNKEY(customer_id)REFERENCESCustomers(customer_id),FOREIGNKEY(loan_id)REFERENCESLoans(loan_id));
CREATETABLEOwes(customer_idTEXTNOTNULL,loan_idTEXTNOTNULL,created_dateDATEDEFAULTCURRENT_DATE,PRIMARYKEY(customer_id,loan_id),FOREIGNKEY(customer_id)REFERENCESCustomers(customer_id),FOREIGNKEY(loan_id)REFERENCESLoans(loan_id));
最后,我们可以从表 3-1中的数据中提取最后一个连接,并将所有欠贷款的客户的观察结果插入到Owes表当中:
Finally, we can extract one last connection from the data in Table 3-1 and insert all observations of customers who owe loans into the Owes table:
INSERTINTOOwes(customer_id,loan_id)VALUES('customer_0','loan_32'),('customer_3','loan_18'),('customer_4','loan_18'),('customer_4','loan_80');
INSERTINTOOwes(customer_id,loan_id)VALUES('customer_0','loan_32'),('customer_3','loan_18'),('customer_4','loan_18'),('customer_4','loan_80');
我们的数据全貌在我们的关系数据库中如图 3-11所示。
The full picture of our data in our relational database is shown in Figure 3-11.
现在数据已经存储在关系数据库中,我们需要询问四个基本问题针对我们的 C360 应用程序的查询:
Now that the data is in our relational database, we need to ask the four fundamental queries for our C360 application:
该客户使用哪些信用卡?
Which credit cards does this customer use?
该客户拥有哪些帐户?
Which accounts does this customer own?
该客户欠哪些贷款?
Which loans does this customer owe?
我们对这个客户了解多少?
What do we know about this customer?
对于我们的关系系统,我们按特定顺序询问这四个问题有两个原因。首先,我们希望慢慢开始,自然而然地从数据库中询问有关人员的更详细问题。其次,我们构建了这些问题,以便技术实现在每个语句的基础上构建,最后得出最终的 SQL 语句。
For our relational system, we are asking these four questions in a specific order for two reasons. First, we want to start slowly with a natural progression toward asking more detailed questions about a person from a database. Second, we structured these questions so that the technical implementation builds upon each statement to conclude with the final SQL statement.
首先,让我们使用关系数据库查询拥有的信用卡customer_0。此查询的数据可直接从CreditCards表中获得。如果我们只想要信用卡信息,我们可以使用以下 SQL 查询来查询表:
First, let’s use our relational database to query for the credit cards owned by customer_0. The data for this query is directly available from the CreditCards table. If we just want the credit card information, we can query the table with the following SQL query:
SELECT*fromCreditCardsWHEREcustomer_id='customer_0';
SELECT*fromCreditCardsWHEREcustomer_id='customer_0';
此查询将返回以下数据:
This query will return the following data:
| cc_编号 | 客户 ID | 创建日期 |
|---|---|---|
cc_17 cc_17 |
客户_0 customer_0 |
2020-01-01 2020-01-01 |
您可能确实想查看客户的数据以及他们的信用卡信息。这需要我们将表Customers与CreditCards表连接起来。在 SQL 中,这可以通过以下方式完成:
It is likely that you really wanted to view the customer’s data alongside their credit card information. This requires us to join the Customers table with the CreditCards table. In SQL, this would be done via:
SELECTCustomers.customer_id,Customers.name,CreditCards.cc_num,CreditCards.created_dateFROMCustomersLEFTJOINCreditCardsON(Customers.customer_id=CreditCards.customer_id)WHERECustomers.customer_id='customer_0';
SELECTCustomers.customer_id,Customers.name,CreditCards.cc_num,CreditCards.created_dateFROMCustomersLEFTJOINCreditCardsON(Customers.customer_id=CreditCards.customer_id)WHERECustomers.customer_id='customer_0';
此查询将返回以下数据:
This query will return the following data:
| 客户.customer_id | 客户名称 | 信用卡.cc_num | 信用卡.创建日期 |
|---|---|---|---|
客户_0 customer_0 |
迈克尔 Michael |
cc_17 cc_17 |
2020-01-01 2020-01-01 |
由于客户和信用卡之间存在一对多关系,因此访问客户及其信用卡信息仅需一条连接语句。当我们需要查看有关客户及其帐户的数据时,事情就变得有点棘手了。
Getting access to a customer alongside their credit card information requires only one join statement because of the one-to-many relationship between customers and credit cards. When we need to look at the data about customers and their accounts, things get a little tricky.
接下来,让我们查询关系数据库来回答这个问题:拥有哪些帐户customer_0?对于此查询,我们需要使用连接表Owns将客户表与帐户表连接在一起。此 SQL 查询为:
Next, let’s query the relational database to answer the question: which accounts does customer_0 own? For this query, we will need to use the join table Owns to join together the customer table with the account table. The SQL query for this is:
SELECTCustomers.customer_id,Customers.name,Accounts.acct_id,Accounts.created_dateFROMCustomersLEFTJOINOwnsON(Customers.customer_id=Owns.customer_id)LEFTJOINAccountsON(Accounts.acct_id=Owns.acct_id)WHERECustomers.customer_id='customer_0';
SELECTCustomers.customer_id,Customers.name,Accounts.acct_id,Accounts.created_dateFROMCustomersLEFTJOINOwnsON(Customers.customer_id=Owns.customer_id)LEFTJOINAccountsON(Accounts.acct_id=Owns.acct_id)WHERECustomers.customer_id='customer_0';
customer_0此查询首先访问客户表中的有关数据。接下来,我们从Owns表中找到具有匹配的所有外键对customer_id。表中只有一个条目Owns表示此客户,因为customer_0只有一个帐户。从这里,我们按照帐户的外键到表中Accounts提取帐户信息。结果数据如下所示:
This query starts by accessing the data about customer_0 from the customer table. Next, we find all foreign key pairs from the Owns table that have a matching customer_id. There is only one entry in the Owns table for this customer because customer_0 owns only one account. From here, we follow the foreign keys of the accounts over to the Accounts table to extract the account information. The resulting data looks like:
| 客户.customer_id | 客户名称 | 账户.acct_id | 账户创建日期 |
|---|---|---|---|
客户_0 customer_0 |
迈克尔 Michael |
acct_14 acct_14 |
2020-01-01 2020-01-01 |
下一个问题使用相同的结构,但Loans使用Owes连接表从客户表追溯到表。这个问题要求提供客户的信息以及他们的贷款详细信息。对于此查询,我们使用有关的数据customer_4。此查询的 SQL 语句是:
The next question uses the same structure but traces from the customer table to the Loans table by using the Owes join table. This question is asking for the customer’s information alongside their loan details. For this query, let’s use the data about customer_4. The SQL statement for this query is:
SELECTCustomers.customer_id,Customers.name,Loans.loan_id,Loans.created_dateFROMCustomersLEFTJOINOwesON(Customers.customer_id=Owes.customer_id)LEFTJOINLoansON(Loans.loan_id=Owes.loan_id)WHERECustomers.customer_id='customer_4';
SELECTCustomers.customer_id,Customers.name,Loans.loan_id,Loans.created_dateFROMCustomersLEFTJOINOwesON(Customers.customer_id=Owes.customer_id)LEFTJOINLoansON(Loans.loan_id=Owes.loan_id)WHERECustomers.customer_id='customer_4';
得到的数据是:
The resulting data is:
| 客户.customer_id | 客户名称 | 贷款.loan_id | 贷款.loan_id |
|---|---|---|---|
客户_4 customer_4 |
艾丽娅 Aaliyah |
贷款_18 loan_18 |
2020-01-01 2020-01-01 |
客户_4 customer_4 |
艾丽娅 Aaliyah |
贷款_80 loan_80 |
2020-01-01 2020-01-01 |
这些查询中的每一个都在构建向 C360 应用程序提出主查询所需的各个部分:针对特定客户,告诉我我们所知道的有关他们的一切。此查询将前面三个查询合并为一个语句。以下 SQL 语句使用我们关系数据库中的所有六个表来查找有关一个客户的所有信息。让我们customer_0在最后一个示例中再次使用:
Each of these queries is building up the individual pieces required to ask the main query for a C360 application: for a specific customer, tell me everything we know about them. This query brings together each of the three previous queries into one statement. The following SQL statement uses all six tables across our relational database to find all information about one customer. Let’s use customer_0 again in this final example:
SELECTCustomers.customer_id,Customers.name,Accounts.acct_id,Accounts.created_date,Loans.loan_id,Loans.created_date,CreditCards.cc_num,CreditCards.created_dateFROMCustomersLEFTJOINOwnsON(Customers.customer_id=Owns.customer_id)LEFTJOINAccountsON(Accounts.acct_id=Owns.acct_id)LEFTJOINOwesON(Customers.customer_id=Owes.customer_id)LEFTJOINLoansON(Loans.loan_id=Owes.loan_id)LEFTJOINCreditCardsON(Customers.customer_id=CreditCards.customer_id)WHERECustomers.customer_id='customer_0';
SELECTCustomers.customer_id,Customers.name,Accounts.acct_id,Accounts.created_date,Loans.loan_id,Loans.created_date,CreditCards.cc_num,CreditCards.created_dateFROMCustomersLEFTJOINOwnsON(Customers.customer_id=Owns.customer_id)LEFTJOINAccountsON(Accounts.acct_id=Owns.acct_id)LEFTJOINOwesON(Customers.customer_id=Owes.customer_id)LEFTJOINLoansON(Loans.loan_id=Owes.loan_id)LEFTJOINCreditCardsON(Customers.customer_id=CreditCards.customer_id)WHERECustomers.customer_id='customer_0';
customer_0这会将数据库中的数据转换为以下内容:
This will transform the data about customer_0 from across the database into the following:
| 客户 ID | 姓名 | 帐户编号 | 创建日期 | 贷款编号 | 创建日期 | cc_编号 | 创建日期 |
|---|---|---|---|---|---|---|---|
客户_0 customer_0 |
迈克尔 Michael |
acct_14 acct_14 |
2020-01-01 2020-01-01 |
贷款_32 loan_32 |
2020-01-01 2020-01-01 |
cc_17 cc_17 |
2020-01-01 2020-01-01 |
我们在本节中演示的四个问题只是触及了 SQL 查询语言的表面。我们只解决了 SQL 的基本问题:SELECT-FROM-WHERE基本连接。尽管我们的问题可以非常简单地陈述,但所需的查询变得越来越复杂。在整个系统中跟踪数据以了解哪些数据与哪个客户相关甚至更加困难。
The four questions we demonstrated in this section are just scratching the surface of the SQL query language. And we addressed only the fundamentals of SQL: SELECT-FROM-WHERE with basic joins. Even though our questions can be stated very simply, the required queries become increasingly complex. It is even harder to follow the data throughout this system to understand which data is related to which customer.
现在我们已经了解了关系实现,让我们深入研究将示例数据转换为图数据库实现。在深入探讨本节的实现细节之前,让我们先重新审视一下图 3-12所示的概念模型。
Now that we have walked through a relational implementation, let’s dig into transforming our sample data into a graph database implementation. Let’s revisit the conceptual model, shown in Figure 3-12, before we dig into the implementation details in this section.
对于这个例子,我们将使用 Gremlin 查询语言(最广泛实现的图查询语言)和 DataStax Graph 模式 API。我们之所以选择使用 Gremlin,是因为它在图数据库社区中被广泛采用,并且扎根于开源。本书的首要目标是在分布式分区环境中实现图。为了实现这一目标,我们将使用 DataStax Graph 架构 API 来构建分布式图。
For this example, we are going to use the Gremlin query language—the most widely implemented graph query language— and DataStax Graph schema APIs. We are choosing to use Gremlin due to its wide adoption across the graph database community and its roots in open source. Our overarching objective in this book is to build up to implementing graphs within a distributed, partitioned environment. Given this goal, we will be using the DataStax Graph schema APIs to build up to working with distributed graphs.
与关系模型相比,从概念模型到图数据模型的转变较小。下面的这条线说明了数据库实现的强大功能,它更贴切地代表了您表达数据的自然方式。
Compared to relational models, there is a smaller transition from a conceptual model to a graph data model. This lower bar illustrates the power of a database implementation that more closely represents your natural way of expressing data.
使用“图模式语言”中的 GSL ,图 3-13包含示例数据的属性图模型。首先要注意的好处是,对于图实现,从概念(图 3-12)到逻辑数据建模的过渡更短。
Using the GSL from “The Graph Schema Language”, Figure 3-13 contains a property graph model for our example data. The first benefit to notice is the shorter transition from conceptual (Figure 3-12) to logical data modeling for graph implementations.
图 3-13中有四个顶点标签:Customer、Account、CreditCard和Loan。这些顶点标签在数据模型中的实体上以粗体显示。图 3-13中有三种边标签:owns、uses和owes。这些边标签在数据模型中的关系上以粗体显示。
There are four vertex labels in Figure 3-13: Customer, Account, CreditCard, and Loan. These vertex labels are shown in bold on the entities in the data model. There are three edge labels in Figure 3-13: owns, uses, and owes. These edge labels are shown in bold on the relationships in the data model.
最后,我们在图 3-13中找到了一些使用属性的地方。一个Customer顶点将具有两个属性:customer_id和name。您可以在顶点标签下方看到每个顶点的属性。我们还在边标签role上添加了一个owns。
Last, we find a few places in Figure 3-13 where we are using properties. A Customer vertex will have two properties: customer_id and name. You can see the properties for each vertex listed underneath the vertex labels. And we have also included a role on the owns edge label.
对于图数据库,我们的第一个实施步骤是创建图,以便我们可以添加图的架构。一旦我们设置了架构,我们就可以准备将数据插入数据库。
With graph databases, our first implementation step will be to create the graph so that we can add the graph’s schema. Once we have set up the schema, we will be ready to insert data into the database.
创建图的代码如下:
The code for creating a graph is as follows:
system.graph("simple_c360").create()
system.graph("simple_c360").create()
我们在提供的技术资产中为您安装和设置了图。如果您想自己深入了解这些步骤,您可以在DataStax 文档中找到分步说明。我们不会在本书中介绍这些主题。
We took care of installing and setting up the graphs for you in the provided technical assets. If you want to dig into those steps on your own, you can find the step-by-step instructions in the DataStax Docs. We will not be covering those topics in this book.
让我们直接开始创建图模式。如果您愿意,可以按照我们为本章创建的 DataStax Studio Notebook 进行操作。DataStax Ch3_SimpleC360Studio为您提供了一个使用 DataStax 产品进行开发的笔记本环境,是实现本书示例的最佳方式。这些笔记本可以在我们书的 GitHub 存储库中找到。
Let’s dive straight into creating graph schema. If you would like, you can follow along in the DataStax Studio Notebook we created for this chapter, Ch3_SimpleC360. DataStax Studio gives you a notebook environment for developing with DataStax products and is the best way to implement this book’s examples. The notebooks are available in our book’s GitHub repository.
首先,让我们创建customer顶点标签。我们的客户数据具有唯一的ID和名称:
First, let’s create the customer vertex label. Our customer data has a unique ID and a name:
schema.vertexLabel("Customer").ifNotExists().partitionBy("customer_id",Text).property("name",Text).create();
schema.vertexLabel("Customer").ifNotExists().partitionBy("customer_id",Text).property("name",Text).create();
让我们通过添加账户、贷款和信用卡的顶点标签来完成顶点标签的创建:
Let’s finish the vertex label creation by adding the vertex labels for accounts, loans, and credit cards:
schema.vertexLabel("Account").ifNotExists().partitionBy("acct_id",Text).create();schema.vertexLabel("Loan").ifNotExists().partitionBy("loan_id",Text).create();schema.vertexLabel("CreditCard").ifNotExists().partitionBy("cc_num",Text).create();
schema.vertexLabel("Account").ifNotExists().partitionBy("acct_id",Text).create();schema.vertexLabel("Loan").ifNotExists().partitionBy("loan_id",Text).create();schema.vertexLabel("CreditCard").ifNotExists().partitionBy("cc_num",Text).create();
此时,数据库中有四个表 — 每个顶点标签一个表。最后一步是将客户与数据模型中其他每个实体的关系添加进去。
At this point, we have four tables in the database—one table per vertex label. The last step is to add the relationships from the customer to each of the other entities in the data model.
对于此示例,我们选择对从客户顶点到其他顶点类型的边进行建模。这些边有方向;它来自客户并到达Accounts、Loans和。当我们创建边标签时,这个方向很重要。让我们看一个创建客户与其帐户之间关系CreditCards的示例:owes
For this example, we selected to model the edges coming out of the customer vertex and into the other vertex types. These edges have direction; it comes from the customer and goes to Accounts, Loans, and CreditCards. When we create an edge label, this direction matters. Let’s look at an example for creating the owes relationship between a customer and their account:
schema.edgeLabel("owes").ifNotExists().from("Customer").to("Loan").create();
schema.edgeLabel("owes").ifNotExists().from("Customer").to("Loan").create();
from此边标签的方向由和步骤设置to。边从顶点标签出发Customer并指向Loan顶点标签。
The direction of this edge label is set with the from and to steps. The edge comes from the vertex label Customer and goes to the Loan vertex label.
还有两个边标签需要创建:一个从客户到他们的信用卡,另一个从客户到他们的帐户。owns边还将具有role存储在边上的属性:
There are two more edge labels to create: one from the customer to their credit card and another from the customer to their account. The owns edges will also have the role property stored on the edge:
schema.edgeLabel("uses").ifNotExists().from("Customer").to("CreditCard").create();schema.edgeLabel("owns").ifNotExists().from("Customer").to("Account").property("role",Text).create();
schema.edgeLabel("uses").ifNotExists().from("Customer").to("CreditCard").create();schema.edgeLabel("owns").ifNotExists().from("Customer").to("Account").property("role",Text).create();
我们读标签说边缘owns来自客户并前往帐户,并具有一个名为的属性。role
We read the label to say the edge owns is coming from the customer and going to the account and has a property called role.
有了图模式后,我们就可以将示例数据添加到这个图数据库中。我们首先添加一条数据——的顶点Michael:
With our graph schema in place, we can add our sample data into this graph database. We will start by adding one piece of data—the vertex for Michael:
michael=g.addV("Customer").property("customer_id","customer_0").property("name","Michael").next();
michael=g.addV("Customer").property("customer_id","customer_0").property("name","Michael").next();
当将顶点添加到图时,该addV步骤要求您提供完整的主键。否则,您将看到如图 3-14所示的错误。
When adding vertices into your graph, the addV step requires you to provide the full primary key. Otherwise, you will see an error like the one shown in Figure 3-14.
接下来,让我们添加迈克尔的账户、贷款和信用卡的顶点:
Next, let’s add the vertices for Michael’s account, loan, and credit card:
acct_14=g.addV("Account").property("acct_id","acct_14").next();loan_32=g.addV("Loan").property("loan_id","loan_32").next();cc_17=g.addV("CreditCard").property("cc_num","cc_17").next();
acct_14=g.addV("Account").property("acct_id","acct_14").next();loan_32=g.addV("Loan").property("loan_id","loan_32").next();cc_17=g.addV("CreditCard").property("cc_num","cc_17").next();
该next()步骤是 Gremlin 中的终止步骤。它返回遍历末尾的第一个结果。在前面的示例中,我们返回刚刚添加到图中的顶点对象并将其存储在内存变量中。
The next() step is a terminal step in Gremlin. It returns the first result from the end of a traversal. In the preceding example, we are returning the vertex object that we just added into the graph and storing it into an in-memory variable.
现在,我们的图数据库中有四个不相连的数据。和以前一样,我们将每个顶点对象存储在名为acct_14、loan_32和 的变量中,cc_17以供稍后使用。数据库本质上有四个没有边的顶点,如图3-15所示。
Now, we have four disconnected pieces of data in our graph database. As before, we stored each vertex object in variables called acct_14, loan_32, and cc_17 to be used later. The database essentially has four vertices with no edges, as seen in Figure 3-15.
让我们介绍数据之间的一些连接。为此,我们需要添加三条从customer_0到其他顶点的边。使用我们刚刚创建的变量,我们可以分别从顶点Michael到顶点account、loan和添加一条边creditCard:
Let’s introduce some connectivity between the data. To do so, we need to add three edges from customer_0 to the other vertices. Using the variables we just created, we can add an edge from the vertex Michael to the vertices account, loan, and creditCard, respectively:
g.addE("owns").from(michael).to(acct_14).property("role","primary").next();g.addE("owes").from(michael).to(loan_32).next();g.addE("uses").from(michael).to(cc_17).next();
g.addE("owns").from(michael).to(acct_14).property("role","primary").next();g.addE("owes").from(michael).to(loan_32).next();g.addE("uses").from(michael).to(cc_17).next();
当将边添加到数据库时,我们首先识别边的来源顶点。在前面的例子中,这是Michael因为所有边都将从其他数据开始并 Michael到达其他数据。这三个边在我们的图数据库中创建了数据的第一个连接视图,如图3-16所示。
When adding edges into the database, we start by identifying the vertex from which the edge will be coming. In the preceding example, this is Michael because all edges will be starting from Michael and going to other pieces of data. These three edges create the first connected view of our data in our graph database, as shown in Figure 3-16.
从我们已经介绍过的示例,我们知道Maria与 共享一个帐户Michael。让我们添加 的顶点Maria并将其连接到我们已经创建的帐户顶点(参见图 3-17):
From the example we already walked through, we know that Maria shares an account with Michael. Let’s add the vertex for Maria and connect it to the account vertex we already created (see Figure 3-17):
maria=g.addV("Customer").property("customer_id","customer_1").property("name","Maria").next();g.addE("owns").from(maria).to(acct_14).property("role","limited").next();
maria=g.addV("Customer").property("customer_id","customer_1").property("name","Maria").next();g.addE("owns").from(maria).to(acct_14).property("role","limited").next();
让我们通过添加有关剩余三位客户的顶点和边来完成此示例:
Let’s finish up this example by adding the vertices and edges about the remaining three customers:
// Data Insertion for Rashikarashika=g.addV("Customer").property("customer_id","customer_2").property("name","Rashika").next();acct_5=g.addV("Account").property("acct_id","acct_5").next();cc_32=g.addV("CreditCard").property("cc_num","cc_32").next();g.addE("owns").from(rashika).to(acct_5).property("role","primary").next();g.addE("uses").from(rashika).to(cc_32).next();
// Data Insertion for Rashikarashika=g.addV("Customer").property("customer_id","customer_2").property("name","Rashika").next();acct_5=g.addV("Account").property("acct_id","acct_5").next();cc_32=g.addV("CreditCard").property("cc_num","cc_32").next();g.addE("owns").from(rashika).to(acct_5).property("role","primary").next();g.addE("uses").from(rashika).to(cc_32).next();
// Data Insertion for Jamiejamie=g.addV("Customer").property("customer_id","customer_3").property("name","Jamie").next();acct_0=g.addV("Account").property("acct_id","acct_0").next();loan_18=g.addV("Loan").property("loan_id","loan_18").next();g.addE("owns").from(jamie).to(acct_0).property("role","primary").next();g.addE("owes").from(jamie).to(loan_18).next();
// Data Insertion for Jamiejamie=g.addV("Customer").property("customer_id","customer_3").property("name","Jamie").next();acct_0=g.addV("Account").property("acct_id","acct_0").next();loan_18=g.addV("Loan").property("loan_id","loan_18").next();g.addE("owns").from(jamie).to(acct_0).property("role","primary").next();g.addE("owes").from(jamie).to(loan_18).next();
// Data Insertion for Aaliyahaaliyah=g.addV("Customer").property("customer_id","customer_4").property("name","Aaliyah").next();loan_80=g.addV("Loan").property("loan_id","loan_80").next();g.addE("owns").from(aaliyah).to(acct_0).property("role","primary").next();g.addE("owes").from(aaliyah).to(loan_80).next();g.addE("owes").from(aaliyah).to(loan_18).next();
// Data Insertion for Aaliyahaaliyah=g.addV("Customer").property("customer_id","customer_4").property("name","Aaliyah").next();loan_80=g.addV("Loan").property("loan_id","loan_80").next();g.addE("owns").from(aaliyah).to(acct_0).property("role","primary").next();g.addE("owes").from(aaliyah).to(loan_80).next();g.addE("owes").from(aaliyah).to(loan_18).next();
这些最终语句完成了将示例数据插入到我们的图数据库的过程。图 3-18显示了数据库中数据的最终视图。
These final statements complete the insertion of the sample data into our graph database. Figure 3-18 shows the final view of the data in the database.
本节中的 Gremlin 语句是我们的第一个图数据库查询。图数据库查询也称为图遍历。
The Gremlin statements in this section are our first graph database queries. A graph database query is also called a graph traversal.
图遍历是一个按照明确定义的顺序访问图的顶点和边的迭代过程。
A graph traversal is an iterative process of visiting the vertices and edges of a graph in a well-defined order.
当使用 Gremlin 时,您可以从遍历源开始遍历。
When using Gremlin, you start your traversals with a traversal source.
遍历源将两个概念结合在一起:您正在遍历的图数据和遍历策略,例如探索没有索引的数据。您将在本书中用作示例的遍历源是dev(用于开发)和g(用于生产)。
A traversal source wraps two concepts together: the graph data you are traversing and traversal strategies, such as exploring data without indexes. The traversal sources you will use for examples in this book are dev (for development) and g (for production).
本节中的查询使用了遍历源。我们将在第 5 章和生产章节中g再次讨论如何使用g遍历源。
The queries in this section used the g traversal source. We will come back to using the g traversal source in Chapter 5 and in the production chapters.
在本章的其余部分,我们将使用dev遍历源。dev在本书中,当我们开发图遍历时,我们将始终使用遍历源,例如在本章、第 4 章和开发章节中。我们使用dev遍历源是因为它允许我们在没有数据索引的情况下探索图数据。
For the rest of this chapter, we will be using the dev traversal source. We will always use the dev traversal source in this book when we are developing our graph traversals, like in this chapter, in Chapter 4, and in the development chapters. We use the dev traversal source because it allows us to explore our graph data without indexes on the data.
从这里开始,让我们继续使用图数据来实现与以前相同的查询。
From here, let’s move on to implementing the same queries as before, but with our graph data.
我们喜欢将查询图数据库大致视为 SQL 查询的反向操作。常见的关系查询思维是SELECT-FROM-WHERE。在图中,我们本质上要求遍历遵循类似的反向模式:WHERE-JOIN-SELECT。
We like to think of querying a graph database, loosely, as the reverse of an SQL query. The common relational querying mindset is SELECT-FROM-WHERE. In graph, we are essentially asking the traversal to follow a similar pattern in reverse: WHERE-JOIN-SELECT.
您可以将 Gremlin 查询想象为从WHERE图数据中需要开始的位置开始。然后告诉数据库使用从起始位置到JOIN不同数据块的关系。最后,你告诉数据库需要哪些数据SELECT并返回。 对于 C360 应用程序,我们的查询大致遵循此WHERE-JOIN-SELECT模式,是学习如何查询图数据库的一个很好的起点。
You can think of a Gremlin query as beginning with WHERE you need to start from in your graph data. Then you tell the database to use relationships from your starting location to JOIN different pieces of data together. Last, you tell the database which data to SELECT and return. For a C360 application, our query loosely follows this WHERE-JOIN-SELECT pattern and is a great starting point for learning how to query a graph database.
考虑到这一点,让我们重新审视我们的 C360 应用程序查询,然后我们将使用 Gremlin 查询语言和图数据库回答每个问题:
With that in mind, let’s revisit our C360 application queries, and then we will answer each question using the Gremlin query language and our graph database:
该客户使用哪些信用卡?
Which credit cards does this customer use?
该客户拥有哪些帐户?
Which accounts does this customer own?
该客户欠哪些贷款?
Which loans does this customer owe?
我们对这个客户了解多少?
What do we know about this customer?
首先,让我们使用图数据库查询 拥有的信用卡customer_0。我们不能只查询任何信用卡;我们需要首先访问 的顶点customer_0,然后走到与 相邻(连接)的信用卡customer_0。在 Gremlin 中:
First, let’s use our graph database to query for the credit cards owned by customer_0. We can’t just query for any credit card; we need to first access the vertex for customer_0 and then walk to the credit card that is adjacent (connected) to customer_0. In Gremlin:
dev.V().has("Customer","customer_id","customer_0").// WHEREout("uses").// JOINvalues("cc_num")// SELECT
dev.V().has("Customer","customer_id","customer_0").// WHEREout("uses").// JOINvalues("cc_num")// SELECT
每行代码右侧的语言//是行内注释,用于描述左侧代码中发生的逻辑。
The language to the right of // in each line of code is an in-line comment to describe the logic happening in the code at left.
此查询将返回以下数据:
This query will return the following data:
"cc_17""cc_17"
让我们将这个 Gremlin 查询分解为WHERE-JOIN-SELECT模式。查询的第一部分是dev.V().has("Customer", "customer_id", "customer_0")。我们说这一步是找到您开始图遍历的位置;我们首先找到一个标签等于的顶点Customer。customer_id此customer_0遍历的第二步是out("uses")。此步骤是将客户与其信用卡数据连接起来values("cc_num")。最后一步是选择要返回的数据。这是步骤。Gremlin 遍历的这一部分是指定要选择并返回给最终用户的数据。
Let’s break down this Gremlin query into the WHERE-JOIN-SELECT pattern. The first part of the query is dev.V().has("Customer", "customer_id", "customer_0"). We say that this step is finding where you are starting your graph traversal; we are starting by finding a vertex with label Customer that has customer_id equal to customer_0. The second step in this traversal is out("uses"). This step is joining the customer to their credit card data. The last step is to pick the data you want to return. This is the values("cc_num") step. This part of the Gremlin traversal is specifying which data to select and return to the end user.
每当你看到“遍历”这个词时,你都会把它与行走联系起来。对我们来说,图遍历就是遍历你的图数据。当我们编写图遍历时,我们会在脑海中想象在图数据片段中来回走动的场景。
Whenever you see the word traversal, you can associate it with the idea of walking. To us, a graph traversal is a walk through your graph data. When we write graph traversals, we picture walking to and from pieces of graph data in our minds.
让我们回到刚刚编写的图查询,向您展示我们如何将遍历视为在图数据中移动。在图查询的第一部分中,我们找到了一个顶点作为我们的起点:customer_0。从这个客户开始,我们需要遍历标记为的传出边uses。我们使用 Gremlin 中的步骤遍历此边out(),以便我们能够到达信用卡顶点。到达信用卡顶点后,我们就可以查看顶点上的属性。具体来说,我们想要访问 Michael 的信用卡号:cc_17。
Let’s go back to the graph query we just wrote to show you how we think of traversals as walking around graph data. In the first part of the graph query, we found a single vertex as our starting place: customer_0. From this customer, we needed to walk through the outgoing edge labeled uses. We walked through this edge using the out() step in Gremlin so that we could arrive at the credit card vertex. Once we were at the credit card vertex, we could look at the properties on the vertex. Specifically, we wanted access to Michael’s credit card number: cc_17.
为了获得最佳性能,我们建议您始终通过完整主键从特定顶点开始遍历。对于 Apache Cassandra 用户,这与为 CQL 查询提供完整主键相同。
For the best performance, we advise that you always start your traversal from a specific vertex via the full primary key. For Apache Cassandra users, this is the same as providing the full primary key for a CQL query.
在开始练习第一次图遍历时,最好有一份图 3-13的副本以供查看。从纸上的图像中,您可以看到需要从哪里开始和结束。这就像使用地图导航一样,但在这种情况下,您正在围绕数据行走。使用图数据,您可以使用图模型找到您的起点,找到您的终点,并将它们之间的行走转换为您的 Gremlin 语句。经过足够的练习,您最终将能够在脑海中完成所有这些。
It helps to have a copy of Figure 3-13 to look at as you start practicing your first graph traversals. From an image on paper, you can see where you need to start and end. This is just like using a map for navigation, but in this case, you are walking around your data. With graph data, you can use your graph model to find your starting place, find your ending place, and translate the walk between them into your Gremlin statements. With enough practice, you will eventually be able to do all of this in your head.
我们应用程序的下一个 C360 查询想知道特定客户拥有哪些帐户。按照与之前相同的模式,我们将访问顶点customer_0并走到帐户顶点。从帐户顶点,我们可以访问帐户的唯一 ID:
The next C360 query for our application wants to know which accounts a specific customer owns. Following the same pattern as before, we are going to access the vertex for customer_0 and walk to the account vertex. From the account vertex, we can access the unique ID for the account:
dev.V().has("Customer","customer_id","customer_0").// WHEREout("owns").// JOINvalues("acct_id")// SELECT
dev.V().has("Customer","customer_id","customer_0").// WHEREout("owns").// JOINvalues("acct_id")// SELECT
与之前一样,此查询遵循以下WHERE-JOIN-SELECT模式。此 Gremlin 查询的第一部分类似于where语句:dev.V().has("Customer", "customer_id", "customer_0")。我们说这一步是找到您从哪里开始图遍历。
As before, this query follows the WHERE-JOIN-SELECT pattern. The first part of this Gremlin query is similar to a where statement: dev.V().has("Customer", "customer_id", "customer_0"). We say that this step is finding where you are starting your graph traversal.
此遍历中的第二步类似于连接语句:out("owns")。此步骤遍历owns来自客户的关系,以将客户与其数据连接起来。最后一步选择要返回给最终用户的数据,具体来说是帐户 ID:。values("acct_id")此查询将返回以下数据:
The second step in this traversal is like a join statement: out("owns"). This step is walking through the owns relationship coming out of the customer to join the customer to their data. The last step selects the data to return to the end user, specifically the account id: values("acct_id"). This query will return the following data:
"acct_14""acct_14"
让我们再次尝试相同的查询,但这次我们想在客户 ID 旁边显示客户姓名。为此,我们需要记住我们在浏览图时访问过的数据。这引入了两个新的 Gremlin 步骤:as()和select()。此as()步骤类似于在浏览图时标记数据,就像在迷宫中行走时留下面包屑一样。
Let’s try the same query again, but this time we would like to display the customer’s name alongside their account ID. To do this, we need to remember the data we have visited as we walked through the graph. This introduces two new Gremlin steps: as() and select(). The as() step is similar to labeling the data as you walk through your graph, like leaving breadcrumbs behind as you walk around a maze.
完成后,我们可以使用另一个新步骤重新调用访问的数据:select()。我们使用该select()步骤返回查询中的数据:
Once we are done, we can recall the visited data with the other new step: select(). We use the select() step to return the data from the query:
dev.V().has("Customer","customer_id","customer_0").// WHEREas("customer").// LABELout("owns").// JOINas("account").// LABELselect("customer","account").// SELECTby(values("name")).// SELECT BY (for the customer)by(values("acct_id"))// SELECT BY (for the account)
dev.V().has("Customer","customer_id","customer_0").// WHEREas("customer").// LABELout("owns").// JOINas("account").// LABELselect("customer","account").// SELECTby(values("name")).// SELECT BY (for the customer)by(values("acct_id"))// SELECT BY (for the account)
与之前一样,此查询遵循相同的模式,但添加了两个内容。此查询添加了对查询的WHERE-JOIN-SELECT需求SAVE和特定数据点。SELECT
As before, this query follows the same WHERE-JOIN-SELECT pattern, with two additions. This query adds in the need to SAVE and SELECT specific data points from the query.
让我们来看看这个查询中的步骤。
Let’s walk through the steps in this query.
再次,我们从图数据中需要去的地方dev.V().has("Customer", "customer_id", "customer_0")开始, 。我们希望记住这些数据以备后用,因此我们使用步骤 保存数据as("customer")。我们继续遵循之前的模式,通过遍历边缘将客户与其帐户数据连接起来owns。现在我们已经到达了帐户顶点。我们想像as()以前一样使用 来保存这个顶点。最后,我们需要选择多条数据。我们使用 来做到这一点select("customer", "account")。
Once again, we start with where we need to go in our graph data, dev.V().has("Customer", "customer_id", "customer_0"). We want to remember this data for later, so we save the data with the step as("customer"). We continue to follow the pattern as before, joining the customer to their account data by walking through the owns edge. Now we have arrived at the account vertex. We want to save this vertex by using as(), like before. Last, we need to select multiple pieces of data. We do this with select("customer", "account").
剩下的两个步骤by很重要。此步骤有助于我们塑造查询结果。在此步骤之后select("customer", "account"),我们有两个顶点对象:分别是客户和帐户顶点。我们最初的查询想要访问客户的姓名和帐户 ID。这就是by步骤的用武之地。我们希望根据客户姓名查看客户,并根据帐户 ID 查看帐户。这些by步骤按顺序应用于顶点对象。
The remaining two steps that use by are important to call out. This step helps us shape the results of our query. After the select("customer", "account") step, we have two vertex objects: the customer and account vertices, respectively. Our original query wanted to access the customer’s name and account ID. That is where the by step comes in. We want to view the customer according to their name and the account according to its ID. The by steps are applied in order to the vertex objects.
此查询返回以下 JSON:
This query returns the following JSON:
{"customer":"Michael","account":"acct_14"}
{"customer":"Michael","account":"acct_14"}
到目前为止,我们已经看到了三种图遍历和两种从图中选择数据的不同方法。接下来,让我们探索 C360 应用程序的第三个查询。此查询想要访问与客户相关的贷款。我们将其用于customer_4此示例,因为她在我们的数据集中有多个贷款。在此查询中,我们只想查看贷款 ID:
So far, we have seen three graph traversals and two different ways to select data from your graph. Next, let’s explore the third query for our C360 application. This query wants to access the loans associated to a customer. Let’s use customer_4 for this example since she has multiple loans in our dataset. In this query we just want to look at the loan IDs:
dev.V().has("Customer","customer_id","customer_4").// WHEREout("owes").// JOINvalues("loan_id")// SELECT
dev.V().has("Customer","customer_id","customer_4").// WHEREout("owes").// JOINvalues("loan_id")// SELECT
此查询遵循我们在上一节中看到的相同WHERE-JOIN-SELECT模式。此查询将返回以下数据:
This query follows the same WHERE-JOIN-SELECT pattern that we saw in the previous section. This query will return the following data:
"loan_18","loan_80"
"loan_18","loan_80"
C360 应用程序的最后一个查询是访问单个客户及其所有相关数据。此查询将从 开始,customer_0并遍历连接到 的所有customer_0传出边。然后,我们返回 的第一个邻域中的所有顶点的数据customer_0。此查询为我们提供了有关 的所有数据customer_0:
The final query of a C360 application is to access an individual customer and all of their relevant data. The query for this will start at customer_0 and walk through all outgoing edges that are connected to customer_0. Then we return the data from all vertices that are in this first neighborhood of customer_0. This query gives us all of the data about customer_0:
dev.V().has("Customer","customer_id","customer_0").// WHEREout().// JOINelementMap()// SELECT *
dev.V().has("Customer","customer_id","customer_0").// WHEREout().// JOINelementMap()// SELECT *
该查询将返回示例 3-1所示的数据。
This query will return the data shown in Example 3-1.
{"id":"dseg:/CreditCard/cc_17","label":"CreditCard","cc_num":"cc_17"},{"id":"dseg:/Loan/loan_32","label":"Loan","loan_id":"loan_32"},{"id":"dseg:/Account/acct_14","label":"Account","acct_id":"acct_14"}
{"id":"dseg:/CreditCard/cc_17","label":"CreditCard","cc_num":"cc_17"},{"id":"dseg:/Loan/loan_32","label":"Loan","loan_id":"loan_32"},{"id":"dseg:/Account/acct_14","label":"Account","acct_id":"acct_14"}
例 3-1显示了 DataStax Graph 中存储的有关每个顶点的所有内容:内部的id、顶点的的label,以及所有属性。让我们检查一下描述迈克尔信用卡的 JSON。首先,有一个"id": "dseg:/CreditCard/cc_17"。这是 DataStax Graph 中用于描述该数据的内部标识符。DataStax idGraph 中的内部是 URI,即统一资源标识符。接下来,我们看到顶点的标签。最后,我们看到"label": "CreditCard"我们在图中存储的有关信用卡的唯一属性:"cc_num": "cc_17"。我们以类似的方式解释有关贷款和账户顶点的 JSON。
Example 3-1 shows everything stored in DataStax Graph about each vertex: an internal id, the vertex’s label, and then all properties. Let’s inspect the JSON that describes Michael’s credit card. First, there is an "id": "dseg:/CreditCard/cc_17". This is the internal identifier used in DataStax Graph to describe that piece of data. The internal id in DataStax Graph is a URI, or Uniform Resource Identifier. Next, we see the vertex’s label, "label": "CreditCard". Last, we see the only property we stored in the graph about credit cards: "cc_num": "cc_17". We interpret the JSON about the loan and account vertices similarly.
这些遍历是提取 C360 应用程序中的数据所需的基础。我们建议您在开始编写图遍历时,在附近保留一份图数据模型的副本。一旦您了解了基本步骤,您就可以使用数据模型的图像从起点走到目的地。经过一些练习,您可能会在脑海中想象出这是一门艺术,就好像您就是在数据周围行走的人一样。
These traversals are the base of what is required to extract the data in your C360 application. We recommend keeping a copy of your graph data model nearby when you are first starting to write graph traversals. Once you understand the basic steps, you can use an image of your data model to walk from your starting point to your destination. After some practice, this is an art that you may be able to visualize in your head as if you were the one walking around the data.
我们构建了这个示例来表明图应用程序可以使数据检索更加容易。如本节所示,查询步骤明显减少,并且更容易遵循。从关系查询语言到图查询语言的调整需要调整您的心态以遍历或浏览数据。学习曲线很陡峭;我们不想隐藏这一点。但是,一旦您可以想象自己浏览图数据,编写图查询就可以像学习一套新工具一样简单。
We constructed this example to show that graph applications can make data retrieval easier. As seen in this section, there were significantly fewer steps to the query, and they were easier to follow. The adjustment from relational to graph query languages requires an adjustment in your mentality to traversing or walking through your data. The learning curve is steep; we don’t want to hide that. However, once you can picture yourself walking through your graph data, writing graph queries can be as simple as learning a new set of tools.
通过 C360 应用程序的视角,让我们考虑一下关系数据库实现与图数据库实现的优点和缺点。在考虑这两种技术时,我们将从四个方面对它们进行比较。我们将研究每种技术的数据建模方法、关系表示方法和查询语言。
Through the lens of a C360 application, let’s consider the benefits and drawbacks of a relational database implementation versus those of a graph database implementation. In considering these two technologies, we are going to compare them in four areas. We are going to examine each technique’s approach to data modeling, representing relationships, and query languages.
在比较关系数据库或图数据库的数据建模差异时,需要考虑定量和主观因素。围绕数据模型设计的定量论据将表明关系系统是明显的赢家,因为关系系统的资源量和生产使用量更高。关系系统的技术、技巧和优化都有很好的文档记录,开发团队中的所有成员和能力都可以访问。
There are quantitative and subjective items to consider when comparing the differences in data modeling for relational or graph databases. The quantitative arguments around data model design will point toward a relational system as the clear winner due to the higher volume of resources and production usage of relational systems. The techniques, tricks, and optimizations for a relational system are very well documented and accessible for all members and abilities within a development team.
从更主观的角度来看,使用图技术的数据建模技术更加直观。具体来说,使用图技术时,数据的人机转换得以保留;您对数据的思考方式与您在计算机中以数字方式表示数据的方式几乎相同。从人类直觉到机器表示的这种更短的转换使您能够更深入地了解数据中的关系。与关系数据库中相同实现所需的系统设计相比,这可以说使图技术更易于使用。
On a more subjective note, data modeling techniques with graph technologies are significantly more intuitive. Specifically, when using graph technology, the human-to-computer translation of data is preserved; the way you think about your data is nearly the same as the way you would represent it digitally in a computer. This shorter translation from human intuition to machine representation allows you to extract deeper insights about the relationships in your data. This arguably makes graph technology easier to use over the system design required for the same implementation in a relational database.
对于在数据库中建模和存储关系的需求一直并且持续增长。这对关系系统来说既是好消息也是坏消息。好消息是,如前所述,使用关系技术建模关系的技巧、窍门和技术都有很好的记录。向现有关系数据库添加关系可以像添加连接表或外键约束一样简单。使用新的连接表或外键,关系是可查询和可访问的。从本质上讲,将数据导入系统是有据可查的,对开发人员来说相对容易。
There has been and continues to be a growing demand for modeling and storing relationships within a database. This has created both good and bad news for relational systems. The good news, as stated before, is that the tips, tricks, and techniques for modeling relationships with relational technology are well documented. Adding relationships to an existing relational database can be as simple as adding a join table or foreign key constraint. With the new join table or foreign key, relationships are queryable and accessible. Essentially, getting the data into the system is well documented and relatively easy for the developer.
不利的一面是,以有意义的方式从关系系统中重新获得关系具有更陡峭的曲线;由于从想法到实现再到机器的差距很大,因此很难推理存储在关系系统中的关系。从对话到建模再到推理的过程与关系技术的脱节程度远大于与图技术的脱节程度。脱节在于将人类对数据的理解映射到关系模型并映射到表格中所需的心理转换。这种转换需要大量的心理解释才能理解和推理存储在关系数据库中的数据中的关系。
On the downside, getting the relationships back out of a relational system in a meaningful way has a much steeper curve; it is very difficult to reason about the relationships stored in a relational system because of the large gap from idea to implementation to machine. The process from conversation to modeling to reasoning is much more disconnected with relational technology than it is with graph technology. The disconnect lies in the mental transformation required to map your human understanding of data into relational models and down to tables. This translation requires significant mental interpretation to follow and reason about relationships within the data stored in a relational database.
图技术就是从这一差距中诞生的。如果需要对数据中的关系进行建模和推理,图技术可以提供从人类理解到机器数据表示的无缝过渡,然后再返回。这一立场的关键在于数据中是否存在关系,以及这些关系是否有助于更深入的分析和推理。如果您需要对数据中的关系进行建模和推理,那么图技术就是您的最佳选择。
Graph technologies were created from this gap. If there is a need to model and reason about relationships in your data, graph technologies provide a more seamless transition from human understanding to machine representation of your data and back. The crux of this stance is whether or not relationships exist within your data and are useful for deeper analytics and reasoning. If you need to model and reason about relationships in your data, then graph technologies are the way to go.
在比较两个系统的查询语言时,我们想研究三个方面:语言复杂性、查询性能和表达能力。
There are three aspects we would like to examine when comparing query languages for the two systems: language complexity, query performance, and expressiveness.
首先,让我们来谈谈语言复杂性的含义。在系统中设计关系后,查询语言会为您在架构中对数据库的评估带来额外的复杂性。在这个层面上,查询语言将突出显示实施过程中产生的所有复杂性或简化。在开发和延长查询以汇总所需数据时,会体验到额外的复杂性。
First, let’s talk about what we mean by language complexity. After designing relationships into your system, the query language introduces additional complexity to your evaluation of the database within your architecture. At this level, it is the query language that will highlight all of the complexities or simplifications that were made during the implementation process. The additional complexity is experienced as queries are developed and lengthened to pull together the required data.
团队通常通过查询开发时间、可维护性和知识传递的简易性来衡量查询语言的复杂性。当您考虑 SQL 和 Gremlin 时,这些比较取决于采用成熟度和个人偏好。在语言成熟度方面,SQL 显然是赢家。然而,我们发现,对于深度嵌套查询或需要大量连接的查询,Gremlin 的优势更为明显。
Teams often measure query language complexity by query development time, maintainability, and ease of transferring knowledge. When you are considering SQL and Gremlin, these comparisons come down to adoption maturity and personal preference. SQL is the clear winner in language maturity. However, we see the scales tip toward Gremlin for deeply nested queries or those requiring a large number of joins.
查询语言的下一个评估是衡量查询性能。查询性能衡量数据库调优练习的多方面和复杂依赖关系,包括索引、分区、负载平衡以及本书无法涵盖的更多优化。
The next evaluation of query languages measures query performance. Query performance measures a multifaceted and complex dependency of database-tuning exercises from indexing, partitioning, load balancing, and more optimizations than will fit in this book.
当我们考虑小型部署中 C360 应用程序的范围时,针对正确索引的关系系统的查询很可能始终优于图数据库中的相同查询。这是因为针对简单的 C360 应用程序的查询是非常浅显的图查询;查询停留在客户的第一个邻域内。随着图查询变得越来越深入,就像我们将在下一章中看到的那样,图技术和关系技术之间的性能争论极大地有利于图解决方案。
When we are considering the scope of a C360 application in a small deployment, it is likely that the queries against a properly indexed relational system will consistently outperform the same queries in a graph database. This is because the queries for the simplistic C360 application are very shallow graph queries; the queries stay within the first neighborhood of the customer. As graph queries get deeper, like what we will see in the next chapter, the performance debate between graph technology and relational technology heavily favors graph solutions.
最后要进行的比较是考虑查询语言的表达能力。根据我们的经验,图查询语言的表现力巩固了在应用程序中使用图数据的能力。两个系统之间的查询复杂性差异表明,像 Gremlin 这样更具表现力的语言对于查询系统中的关系有显著的改进。图数据库之上的图查询语言可以大大减少访问和提取数据关系所需的代码。只有时间才能让图技术成熟到与关系标准相同的水平。
The last comparison to make considers query language expressiveness. In our experience, the expressiveness of graph query languages solidifies the power of using graph data in an application. The difference in query complexity between the two systems illustrates that a more expressive language like Gremlin is a significant improvement for querying relationships in your system. Graph query languages on top of graph databases allow for a significant reduction in the code required to access and extract relationships from data. Only time will allow graph technologies to mature to the same levels as relational standards.
简单总结一下,我们可以对每个选项进行如下配置,参见表3-2。
For a loose summation, the points that we can make for each option, see Table 3-2.
| 关系型 | 图数据库 | |
|---|---|---|
数据建模 Data modeling |
有据可查 Well documented |
数字表示与人类解释相符 Digital representation matches human interpretation |
表示数据中的关系 Representing relationships in data |
已知的限制和复杂性 Known limitations and complexities |
更直观的表达 More intuitive representation |
查询 Queries |
有据可查 Well documented |
学习难度高 Steep learning curve |
同时查询多个关系时很困难 Difficulty when querying many relationships together |
更具表现力的查询语言 More expressive query language |
对于任何可以比较这两种技术的领域,两种选择的优点和缺点都归结为成熟度。关系技术的采用、文档和社区比图技术更加发达。这种成熟度可能意味着传统应用程序的风险更低、执行速度更快。如今,在成熟度和新应用程序的交付时间方面,图技术无法与关系技术相媲美。
For any area in which you can compare these two technologies, the advantages and disadvantages for either choice come down to maturity. The adoption, documentation, and community are much more evolved for relational technologies than for graph technologies. This maturity likely translates to lower risk and faster execution for traditional applications. Today, graph technologies cannot compete with relational in the categories of maturity and time to delivery for a new application.
另一方面,关系技术在提供数据中关系的宝贵见解方面已达到极限。这是一个重大问题,因为关系自然存在于数据中,并且有助于提供对业务的更好见解。在这方面,图技术对于需要关系来做出业务决策的应用程序来说是更好的选择。它是提供和推理数据中关系的最佳选择,这是关系数据库无法在深度和规模上实现的。
On the other hand, relational technology is reaching its limits for delivering valuable insights into relationships within data. This is a significant problem because relationships naturally occur within data and are instrumental in delivering improved insights into your business. In this regard, graph technology is the better option for applications that require relationships to make business decisions. It is the best choice for delivering and reasoning on relationships within your data, which is not achievable at depth and scale with relational databases.
使用图技术实现 C360 应用程序的功能和愿景与您的企业访问整个组织中的相关数据的需求直接相关。
The power and vision of implementing a C360 application with graph technology is directly correlated to your business’s need for accessing related data across your organization.
让我们来解释一下这是什么意思。
Let’s unpack what we mean by that.
我们咨询过许多企业,这些企业在过去十年中做出了特定的技术选择,进而导致了数据孤岛的建立。这些数据孤岛将与其业务核心实体(例如客户)相关的数据分开。从那时起,最近的方法导致将重要数据集成到大型整体系统中,例如数据湖。这里的痛点不在于公司数据的集成,而在于数据的可访问性。
We have consulted with many enterprises that made specific technology choices over the past decade that in turn led to the construction of data silos. These data silos separated the data relevant to the core entities of their business, such as the customer. From there, recent approaches led to the integration of important data into large monolith systems, such as data lakes. The pain points here were not in the integration of the company’s data but in its accessibility.
谁愿意花费时间和资源在数据湖中寻找有价值的数据,而不是使用专门设计用于检索有价值数据的系统?
Who wants to spend time and resources fishing for valuable data in a data lake instead of using a system designed to retrieve valuable data?
对于这些企业来说,图思维的出现指导了其数据架构的下一次迭代。他们的目标是利用技术来构建数据,使其数据可用并代表客户体验。这种可用性和代表性的结合一直是图技术背后的驱动力,并且将继续如此。
For these enterprises, the advent of graph thinking has guided the next iteration of their data architecture. Their goal is to build with technologies that make their data available and representative of their customers’ experience. This combination of availability and representation has been and continues to be the driving momentum behind graph technology.
图技术正在以一种以前无法实现的方式实现企业数据架构的下一次迭代。我们在本章中深入研究了图数据管理的一个版本。也就是说,我们探讨了 Customer 360 应用程序的应用和实现细节,这是图技术的一个客户特定用例。然而,使用图技术构建以数据为中心的应用程序的相同模板也适用于非面向客户的应用程序。
Graph technologies are enabling the next iteration of enterprise data architectures in a way that was previously unachievable. We delved into one version of graph data management in this chapter. Namely, we explored the application and implementation details of a Customer 360 application, a customer-specific use case for graph technology. However, this same template for building data-centric applications with graph technology applies to non-customer-facing applications.
我们看到一些公司围绕与之互动的业务构建了类似的系统,有点像 Business 360。这些应用程序组织并传递了有关其业务内重要互动的所有信息,从而节省了跨部门沟通的大量开销。例如,想象一下,您必须与公司内的所有不同部门合作,以找出贵公司与另一家供应商之间的最新互动。此请求的信息分布在财务、营销、销售、客户关系以及可能的其他部门。解决这个 B2B 问题需要使用本章中描述的相同模板。
We have seen companies build similar systems around the businesses they interact with—kind of like a Business 360. The applications that organize and deliver all information about important interactions within their business are saving significant overhead in cross-departmental communication. For example, imagine all of the different departments you have to collaborate with in your company to find out the most recent interaction between your company and another vendor. The information for this request is spread across finance, marketing, sales, customer relations, and likely other departments. The solution to this B2B problem requires the same template as we have described throughout this chapter.
鉴于这种应用程序风格的愿景,下一个要评估的标准是实施时间和成本。这些选择可能涉及比较供应商和现有工具,例如为您的 C360 应用程序使用关系或图技术。
Given the vision for this style of application, the next criteria to evaluate are the time and cost of implementation. These choices likely involve comparing vendors and existing tools, such as using relational or graph technology for your C360 application.
我们经常会被问到这个问题:“我可以用 RDBMS 构建 C360 应用程序,那么为什么不“如何使用我已经知道的知识?”
We get this question all the time: “I can build a C360 application with an RDBMS, so why not use what I already know?”
我们对这个问题的简短回答是:关系型数据库适合表格数据,图数据库更适合复杂数据。除此之外,两者非常相似。从根本上讲,您的选择取决于数据的复杂性以及您希望从中获得什么价值。
The short version of our response to this question: relational is great for tabular data, graph is better for complex data. Otherwise, the two are remarkably similar. At its root, your choice comes down to the complexity of your data and what value you want to get out of it.
从长远来看,关键在于您的企业如何看待时间:花在设计定制解决方案上的时间和花在等待查询上的时间。当您的业务需要回答更深层次或计划外的查询时,差异非常明显。关系系统需要架构更改、添加表和构建您自己的查询语言。图系统需要扩充您的架构并插入更多数据。
In the longer version, the key is how your business values time: time spent on engineering custom solutions and time spent on waiting for queries. The differences are quite clear when your business needs to answer deeper or unplanned queries. Relational systems require architectural changes, adding tables, and building your own query languages. Graph systems require augmenting your schema and inserting more data.
从本质上讲,图技术使处理复杂数据变得更容易,而关系技术则更适合处理简单(即表格)数据。您的项目需要扩展的深度和复杂性将帮助您更清楚地做出这一选择。
Essentially, graph technology makes it easier to work with complex data, whereas relational technology is easier for simple (i.e., tabular) data. The depth and complexity your project needs to expand into will help make this choice clearer for you.
关系技术和图技术之间的决定最终取决于您的 C360 应用程序的全部范围。总体而言,我们的经验表明,如果您的应用程序仅旨在统一不同的数据源,那么通过适当调整关系系统,您将获得最佳结果。这种对应用程序唯一功能的认识和承诺将节省开发资源,并更快地实现生产系统的最终交付。
The decision between relational and graph technology ultimately comes down to your C360 application’s full scope. Generally speaking, our experiences have shown us that if your application aims only to unify disparate data sources, then you will achieve the best results from properly tuning a relational system. This realization and commitment to the sole function of your application will save development resources and more quickly get to final delivery for the production system.
另一方面,如果数据管理解决方案或 C360 应用程序是您的数据架构的起点,那么从长远来看,图数据库的陡峭学习曲线将带来更多价值。图技术可以更直观地推理数据中存在的关系。需要洞察关系的业务目标也需要图技术的支持。
On the other hand, if a data management solution or C360 application is a starting point for your data architecture, then the steep learning curve to graph databases will deliver more value in the long run. Graph technologies enable more intuitive reasoning about the relationships that exist across your data. The business objectives that require insight into relationships also require graph technology behind them.
让我们在这里明确一下我们的观点。本章中的示例非常原始。任何更现实、更复杂的东西都开始成为 RDBMS 的延伸。现实数据中包含复杂的关系。如果您的业务需要访问这些关系,那么您需要图技术。
Let us be clear on our points here. The example in this chapter is incredibly primitive. Anything more realistic and more elaborate starts to become a stretch for RDBMS. And realistic data contains elaborate relationships within it. If your business needs access to these relationships, then you need graph technology.
如果您只想构建一个简单的 C360 系统,那么请使用关系技术。如果您想了解和探索数据中的连通性,那么请使用图技术。每种选择都有优缺点,但对于我们在本章中设置的场景,图技术是赢家。
If you are going to just build a simple C360 system and nothing more, use relational technology. If you want to understand and explore the connectedness within your data, use graph technology. There are pluses and minuses for each choice, but for the scenario we have set up in this chapter, graph technology is the winner.
无论您的业务面临哪种数据问题,请注意,那些需要构建和扩展基础的团队正在转向图技术。要将图技术成功集成到您的架构中,需要以 C360 应用程序为基础,然后从那里开始构建。以 C360 应用程序为基础,您的业务就可以进行更深入的图遍历,以从数据中获取更有价值的见解。在下一章中,我们将把我们简单的 C360 应用程序扩展到更完整的场景,这将重点介绍图技术和 RDBMS 在易用性和上市时间方面的差异。
Whichever data problem your business faces, be aware that those teams with the need to build and extend a foundation are turning to graph technology. A successful integration of graph technology into your architecture needs to start with a C360 application as its foundation and build from there. With a C360 application as a foundation, your business is set up to go after deeper graph traversals for more valuable insights from your data. In the next chapter, we are going to extend our simplistic C360 application to a more complete scenario that will highlight how graph technology and RDBMS diverge in terms of ease of use and time to market.
1 Mark Abraham 等人。“从个性化中获利。”波士顿咨询集团,2017 年 5 月 8 日。https ://www.bcg.com/publications/2017/retail-marketing-sales-profiting-personalization.aspx。
1 Mark Abraham, et al. “Profiting from Personalization.” Boston Consulting Group, May 8, 2017. https://www.bcg.com/publications/2017/retail-marketing-sales-profiting-personalization.aspx.
为了进入图应用程序开发的下一阶段,我们将以第 3 章中简单的 Customer 360 (C360) 应用程序为基础进行构建。我们将在该示例上添加更多层或邻域,以说明图思维中的下一波概念。
To get to the next phase in graph application development, we are going to build upon the simple Customer 360 (C360) application from Chapter 3. We’ll add a few more layers, or neighborhoods, onto that example to illustrate the next wave of concepts in graph thinking.
在我们的示例中添加数据可以更真实地反映数据建模、查询和将图思维应用于以客户为中心的财务数据的复杂性。
Adding data to our example provides a more realistic picture of the complexity of data modeling, querying, and applying graph thinking to our customer-centric financial data.
我们认为从第 3 章中的基本示例过渡到本章中的复杂示例,类似于学习如何潜水的过程中的步骤。我们在第 3 章中所做的就像开始学习如何在浅水池中潜水;当你在那么浅的水中时,你并不清楚重点是什么。但我们需要从熟悉的地方开始。本章中的示例就像在深水池中潜水。之后,我们将准备在第 5 章中进入更有趣的深度。
We consider the transition from the basic example in Chapter 3 to the complexity in this chapter to be analogous to steps in the process of learning how to scuba dive. What we did in Chapter 3 was like starting to learn how to scuba dive in a wading pool; it is not really clear what the point is when you are in water that shallow. But we needed to start from a familiar place. The examples in this chapter are like scuba diving in a deep pool. Afterwards, we will be ready to head into more interesting depths in Chapter 5.
本章主要分为三个部分。
There are three main sections within this chapter.
在第一部分中,我们将探索和解释图思维,以介绍图数据建模的最佳实践。我们将通过向 C360 示例添加更多数据邻域来实现这一点,以便我们能够回答以下问题:
In the first section, we will explore and explain graph thinking to present best practices in graph data modeling. We will do this by adding more neighborhoods of data to our C360 example so that we can answer the following questions:
迈克尔账户最近涉及的 20 笔交易有哪些?
What are the most recent 20 transactions involving Michael’s account?
12 月份,迈克尔在哪些供应商处购物?频率如何?
In December, at which vendors did Michael shop, and with what frequency?
查找并更新 Jamie 和 Aaliyah 最看重的交易:从他们的账户支付到抵押贷款。(查询 3 是个性化的示例。)
Find and update the transactions that Jamie and Aaliyah most value: their payments from their account to their mortgage loan. (Query 3 is an example of personalization.)
在本初始部分中,我们将遵循查询驱动设计来说明创建属性图数据模型的常见最佳实践。主题包括将数据映射到顶点或边、建模时间和常见错误。
Throughout this initial section, we will follow query-driven design to illustrate common best practices for creating a property graph data model. Topics include mapping your data to vertices or edges, modeling time, and common mistakes.
在下一节中,我们将构建更深入的 Gremlin 查询。这些查询遍历数据的三、四和五个邻域。我们还将介绍如何使用属性对图数据进行切片、排序和范围划分,并讨论在时间窗口中进行查询。在本节结束时,我们将说明我们为示例计划的所有数据、技术概念和数据建模。
In the next section, we will build up deeper Gremlin queries. These queries walk through three, four, and five neighborhoods of data. We will also introduce how to use properties to slice, order, and range over graph data, and we will discuss querying in time windows. By the end of this section, we will have illustrated all of the data, technical concepts, and data modeling that we planned for our example.
我们将在本章的最后回顾基本查询,介绍一些更高级的查询技术。这些技术通常是将查询结果格式化为更方便用户使用的结构的一部分。
We will end the chapter by revisiting the basic queries to introduce some more advanced querying techniques. These techniques are most commonly a part of trying to format your query results into a more user-friendly structure.
这些内容为我们展示了该示例的最终的、生产质量的模式做好了准备,我们将在第 5 章中进行介绍。
This content sets us up to present the final, production-quality schema for this example, which we will do in Chapter 5.
在使用 Apache Cassandra 支持的图数据库的早期,我的团队坐在我们由风险投资支持的初创公司客厅的沙发上。我们正在白板上建立一个图数据模型,用于在图数据库中存储医疗保健数据。
During the early days of working with graph databases backed by Apache Cassandra, my team was sitting around the couches in the living room of our venture-backed startup. We were whiteboarding a graph data model for storing healthcare data in a graph database.
我们很快就同意,医生、患者和医院是我们最重要的实体,因此它们将成为顶点。此后的一切都成了争论。顶点、边、属性和名称:每个人对每件事都有自己的看法。我们最难忘的分歧是两极分化的。我们应该如何命名医生和患者之间的边?所有这些实体都生活或工作在某个地方;我们如何对地址进行建模?国家是顶点还是属性,还是应该将其排除在我们的模型之外?
We quickly agreed that doctors, patients, and hospitals were our primary entities of importance, and therefore they would be vertices. Everything else after that was a debate. Vertices, edges, properties, and names: everyone had a defensible opinion about everything. Our most memorable disagreements were polarizing. What should we name the edges between doctors and patients? All of these entities live or work somewhere; how do we model addresses? Is country a vertex or a property, or should it be left out of our model?
这是一次艰难的对话。我们花了比预期更长的时间才达成设计共识,而且我们都没有真正对此感到满意。
It was a difficult conversation. It took much longer than we had expected to arrive at a design consensus, and none of us really felt comfortable with it.
自那次设计会议以来,每次我为世界各地的图团队提供建议时,我都能感受到类似的紧张局势,看到类似的设计共识。紧张局势始终存在,始终存在,始终可见。
Since that design session, each time I advise a graph team around the world, I can feel similar tensions and see similar design consensus. The tensions are always real, always there, and always observable.
本节旨在帮助您的团队就图数据模型进行更具建设性的讨论。为了实现这一点,我们将介绍创建良好图数据模型的三部分建议。这些建议包括:
This section is all about helping your team have a more constructive discussion about your graph data model. To accomplish this, we want to walk through three sections of advice for creating a good graph data model. Those sections of advice will be:
这应该是顶点还是边?
Should This Be a Vertex or an Edge?
迷路了吗? 指引我走向正确方向。
Lost yet? Walk Me Through Direction.
图没有名字:命名中的常见错误
A Graph Has No Name: Common Mistakes in Naming
我们选择这些主题有两个原因。首先,这些主题涵盖了您在建模过程中会遇到的大多数争论点。其次,这些主题支持我们开发这些章节的运行示例。当我们到达那里时,将介绍更深入和更高级的建模建议的详细信息。
We selected these topics for two reasons. First, these topics cover most of the points of contention you will encounter during the modeling process. Second, these topics support where we are in the development of the running example for these chapters. Details for deeper and more advanced modeling advice will be introduced when we get there.
这是关于属性图建模最具争议的话题。我们从最激烈的争论中总结出了一些创建图数据模型的技巧。
This is the most debatable topic about property graph modeling. From the middle of the most heated debates, we’ve grabbed a number of tips for creating graph data models.
让我们从头开始我们的提示。在我们的世界中,起点就是您想要开始图遍历的地方。
Let’s start our tips at the beginning. In our world, the beginning is where you want to start your graph traversals.
如果您想从某块数据开始遍历,请将该数据作为顶点。
If you want to start your traversal on some piece of data, make that data a vertex.
为了阐释我们的第一个技巧,让我们重新回顾一下在第 3 章中构建的一个查询:
To unpack our first tip, let’s revisit one of the queries we constructed in Chapter 3:
该客户拥有哪些帐户?
Which accounts does this customer own?
回答这个问题需要三部分数据:客户、账户以及客户拥有账户的连接。想想如何使用这些数据来“查找 Michael 拥有的所有账户”。有两种方法可以将此语句转换为数据库查询:“Michael 拥有账户”或“Michael 拥有的账户”。
There are three pieces of data required to answer that question: customers, accounts, and a connection from which customer owns an account. Think about how you could use that data to “find all accounts owned by Michael.” There are two ways to translate this statement into a database query: “Michael owns accounts” or “accounts owned by Michael.”
让我们讨论第一个选项:从 Michael 开始查找他的帐户。这意味着您从有关人员的数据开始 - 具体来说,是关于 Michael 的数据。在您的脑海中,当您找到查询的起点时,您会希望将该数据转换为图模型中的顶点标签。这样,我们就有了图模型的第一个顶点标签:客户。
Let’s talk about the first option: starting with Michael to find his accounts. This means that you are starting with data about people—specifically, the piece of data about Michael. In your head, when you find a starting place for a query, you would want to translate that data to being a vertex label in your graph model. With this, we have our first vertex label for our graph model: customers.
考虑第二种查找此信息的方法:您可以先找到所有帐户,然后仅保留 Michael 拥有的帐户。在本例中,您从帐户数据开始。现在我们的图模型有了第二个顶点标签:帐户。
Consider the second way to find this information: you could first find all accounts and then keep only those that are owned by Michael. In this case, you are starting with the data about accounts. Now we have a second vertex label for our graph model: accounts.
This sets us up for the next tip on how to find the edges in your data.
如果您需要数据来连接概念,请将数据作为边缘。
If you need the data to connect concepts, make that data an edge.
对于我们正在处理的查询,我们知道 Michael 将是一个顶点标签,而他的帐户是另一个顶点标签。剩下的就是所有权的概念,是的,你猜对了——它将是边。对于我们的示例数据,所有权的概念将客户与帐户联系起来。
For the query we are working with, we know that Michael will be a vertex label and that his account is another vertex label. That leaves the concept of ownership, and yes, you guessed it—it will be the edge. The concept of ownership links a customer to an account for our example data.
要找到模型中的边线,请检查数据。您可以从将概念联系在一起且您有权访问的信息中找到边线。
To find the edges in your model, examine your data. You find your edges from within the information that links concepts together and to which you have access.
在处理图数据时,这些边是图模型中最重要的部分。边就是您需要图技术的原因。
When working with graph data, these edges are the most important piece of your graph model. Edges are why you need graph technology in the first place.
将这两者结合起来,可以得出标记属性图模型的以下规则。
Putting these two together, you can derive the following rule for labeled property graph models.
Vertex-Edge-Vertex 应该像查询中的句子或短语一样读取。
Vertex-Edge-Vertex should read like a sentence or phrase from your queries.
我们在此的建议是将您想要如何使用数据进行查询写成短语,例如“客户拥有帐户”。识别这些查询和短语仍然是一种简单的方法,可以识别您想要如何将数据映射到属性图数据库中的图对象中,如图4-1所示。
Our advice here is to write out how you want to query with your data into short phrases like “customer owns account.” Identifying these queries and phrases remains a simple way to identify how you want to map your data into graph objects in a property graph database, as shown in Figure 4-1.
acct_14,其边(关系)标题为;这说明了将名词-动词-名词owns的短语转换为属性图模型的示例:Michael 拥有账户 14一般来说,图查询的书面形式会将动词转换为边,将名词转换为顶点。
Generally speaking, written forms of your graph queries will translate verbs to edges and nouns to vertices.
这不是图谱社区第一次使用语义短语和图谱数据。语义社区的各位可能会大喊:“我们以前见过这种情况!”你是对的;我们见过。1
This isn’t the first time the graph community has worked with semantic phrases and graph data. Those of you from the semantic community are likely shouting, “We’ve seen this before!” And you are right; we have.1
将建议 2 和建议 3 放在一起,可以得到一种将您的想法转化为图对象的特定方法。
Putting recommendations #2 and #3 together yields a specific way to translate how you think into graph objects.
名词和概念应为顶点标签。动词应为边标签。
Nouns and concepts should be vertex labels. Verbs should be edge labels.
根据您的思维方式,有时技巧 3 和技巧 4 可能会产生模棱两可的情况。我们想在这里深入探讨一些语义,以帮助您了解人们看待和思考数据的不同方式。
Depending on how you think, there are times at which tip #3 and tip #4 can create ambiguous scenarios. We want to delve into some semantics here to help you navigate different ways that people see and think about data.
具体来说,如果您认为“Michael 拥有一个帐户”,那么“拥有”应该是一个边标签。在这种情况下,您正在积极思考 Michael 和他的帐户之间的关系。这种积极的思路转化owns为将两部分数据连接在一起的动词。这就是我们得出“拥有”作为边标签的方式。
Specifically, if you think “Michael owns an account,” then “owns” should be an edge label. This is a case in which you are thinking actively about the relationship between Michael and his account. And this active line of thought translates owns to a verb that connects two pieces of data together. This is how we arrive at “owns” as an edge label.
但是,在某些情况下,您可能会以不同的方式看待同一场景。也就是说,如果您认为“我们需要表示 Michael 和他的帐户之间的所有权概念”,那么所有权应该是一个顶点标签。在这种情况下,您将所有权视为名词,即实体。不同之处在于,在这种情况下,所有权可能需要可识别。您可能正在尝试以其他方式关联所有权。在这些情况下,您可能计划询问的其他问题是“谁建立了该所有权?”或“如果主要代理人去世,所有权将转移给谁?”
However, there are cases in which you may see this same scenario differently. Namely, if you are thinking “We need to represent the concept of ownership between Michael and his account,” then ownership should be a vertex label. In that case, you are thinking of ownership as a noun—that is, an entity. The difference is that in this case, it is likely that the ownership needs to be identifiable. You probably are trying to relate ownership in other ways. In these cases, other questions you may plan on asking are, “Who established that ownership?” or “Who does the ownership transfer to if the primary agent dies?”
我们承认我们正陷入困境。但我们知道你最终也会陷入困境。我们希望我们提供的指导能帮助你找到回归和走出困境的路。
We acknowledge that we are getting into the weeds here. But we know that you will eventually find yourself in the weeds as well. We hope that the guidance we are providing will help you find your way back up and out.
我们的前四个技巧介绍了识别图数据中的顶点和边的基础知识。让我们来看看如何推断边标签的方向。
Our first four tips introduced the fundamentals for identifying vertices and edges in your graph data. Let’s walk through how to reason about the direction of your edge labels.
问题和本章的查询将更多数据集成到我们的模型中。具体来说,我们希望将交易添加到我们的数据中,以便我们能够回答以下问题:
The questions and queries for this chapter integrate more data into our model. Specifically, we want to add transactions into our data so that we can answer questions like:
迈克尔账户最近涉及的 20 笔交易有哪些?
What are the most recent 20 transactions involving Michael's account?
为了回答这个问题,我们需要将交易添加到我们的数据模型中。这些交易需要为我们提供一种方法,让我们能够对账户、贷款和信用卡之间的取款和存款交易进行建模和推理。
To answer this query, we need to add transactions into our data model. And these transactions need to give us a way to model and reason about how transactions withdraw and deposit money between the accounts, loans, and credit cards.
当您首次开始编写图查询并在数据模型上进行迭代时,很容易在数据模型中迷失方向。边缘标签的方向很难推断,这就是我们提出以下建议的原因。
When you first start writing graph queries and iterating on data models, it is very easy to get turned around in your data model. Direction of an edge label is a difficult thing to reason around, which is why we make the following recommendation.
在开发过程中,让边缘的方向反映出您如何看待域中的数据。
When in development, let the direction of your edges reflect how you would think about the data in your domain.
技巧 #5 可在您结合并应用前四个技巧中的建议时推断边标签的方向。此时,顶点-边-顶点的模式应该很容易理解为主谓宾句。
Tip #5 infers the direction of an edge label as you combine and apply the advice from the previous four tips. At this point, the pattern of Vertex-Edge-Vertex should be easily read as subject-verb-object sentences.
因此,边标签的方向来自主体并指向客体。
Therefore, the edge label’s direction comes from subject and goes to object.
我们已经多次讨论过如何在交易之间添加边标签。让我们来回顾一下我们的思考过程,详细了解我们如何在图中对交易之类的东西进行建模。
Coming up with edge labels between transactions is a discussion we have seen play out many times. Let’s follow through our thought process to detail how we reasoned about modeling something like a transaction in a graph.
想想你首先如何将交易添加到你的图模型中。你可能正在考虑一个账户如何与其他账户进行交易,或者像我们在图 4-2中展示的那样。
Think about how you would first add transactions into your graph model. You likely are thinking about how an account transacts with other accounts, or something like we are showing in Figure 4-2.
图 4-2中的模型不适用于我们的示例,因为它使用交易作为动词的概念,而我们的问题使用交易作为名词。我们想知道诸如账户最近的交易以及哪些交易是贷款支付之类的信息。从这个角度来看,我们实际上是将交易视为名词。
The model for Figure 4-2 doesn’t work for our example because it uses the idea of a transaction as a verb, whereas our questions use transactions as nouns. We want to know things like an account’s most recent transactions and which transactions are loan payments. In this light, we are really thinking about transactions as nouns.
Therefore, transactions need to be vertex labels in our example.
现在我们需要推断边的方向。大多数人从建模边方向开始,以遵循资金的流动,如图4-3所示。
Now we need to reason about the direction of the edges. Most people start with modeling edge direction to follow the flow of money, as shown in Figure 4-3.
像图 4-3这样的模型的挑战在于为边想出直观的名称,以便于回答我们本章的问题。图 4-3中的边方向模拟了资金流动,这对于我们在问题中使用交易的方式来说很不方便。我们会说“这个账户通过这笔交易提取了钱”吗?但愿不会。
The challenge with a model like Figure 4-3 is to come up with intuitive names for the edges that make it easy to answer our chapter’s questions. The edge direction in Figure 4-3 models the flow of money and is awkward for how we are using transactions in our questions. Would we say, “This account had money withdrawn from it via this transaction”? Let’s hope not.
所以图 4-3也不适合我们的示例。
So Figure 4-3 isn’t going to work for our example, either.
让我们回顾一下本章的问题,并思考如何在查询中使用事务。我们针对示例中使用事务的上下文提出了以下主谓宾句子:
Let’s recall our chapter’s questions and reason about how we use transactions in the queries. We came up with the following subject-verb-object sentences for the context in which we are using transactions in our example:
交易从账户中提取。
Transactions withdraw from accounts.
交易存入账户。
Transactions deposit to accounts.
这两个短语可能会奏效;让我们看看它如何与数据一起工作。在我们的数据中,我们可以对交易及其与账户的交互进行建模,如图 4-4所示。
These two phrases might work; let’s see how this would work with data. In our data, we could model a transaction and how it interacted with accounts as shown in Figure 4-4.
对于本章中的示例,我们认为图 4-4使得使用我们的模型回答问题变得相当容易。这为我们的两个标签提供了方向:边标签将从 a 流向Transactionan 。该模式如图 4-5Account所示。
For the example in this chapter, we think that Figure 4-4 makes it reasonably easy to use our model to answer our questions. This gives us direction for both of our labels: the edge labels will flow from a Transaction and go to an Account. The schema is shown in Figure 4-5.
通过将查询分解为结构为“主谓宾”的简短主动短语,您将能够自然地找到图模型中需要作为顶点或边标签的内容。然后,边标签的方向将从主语转向宾语。
By breaking down your queries into short, active phrases of the structure subject-verb-object, you will be able to naturally find what needs to be a vertex or edge label in your graph model. Then the edge label’s direction will come from the subject and go to the object.
让我们从交易建模方向的细微差别中抽身出来,回到图模式的最后一个主要元素:属性。
Let’s zoom out from the nuances of modeling direction for transactions and get back to the final main element of a graph’s schema: properties.
Let’s repeat the first query that will use the transaction vertices:
迈克尔账户最近涉及的 20 笔交易有哪些?
What are the most recent 20 transactions involving Michael's account?
上述查询的简短版本可翻译为以下短语:
The short version of our query from above translates to the following short phrases:
Michael 拥有账户
Michael owns account
交易从他的账户中提取
Transactions withdraw from his account
选择最近的20笔交易
Select the most recent 20 transactions
到目前为止,我们可以浏览图中的客户、账户和交易。现在我们的问题是询问某个账户最近的 20 笔交易。这意味着我们需要对交易进行细分,以仅包含最近的交易。
So far, we can walk through customers, accounts, and transactions within our graph. Now our question asks for the 20 most recent transactions from an account. This means that we need to subselect our transactions to include only the most recent ones.
因此,我们希望能够按时间过滤交易。这引出了我们与数据建模决策相关的最后一条建议。
Therefore, we will want the ability to filter transactions by time. This brings us to our last tip related to data modeling decisions.
如果您需要使用数据来选择一个组,请将其设为属性。
If you need to use data to subselect a group, make it a property.
按时间对交易进行排序需要我们将该值存储在图模型中:输入属性。这是交易顶点属性的一个很好的用途,这样我们就可以在模型中对这些顶点进行子选择。图 4-6显示了我们如何将时间添加到正在进行的示例中。
Ordering transactions by time requires us to have that value stored in our graph model: enter properties. This is a great use of a property on the transaction vertex so that we can subselect those vertices in our model. Figure 4-6 shows how we would add time into our ongoing example.
总之,技巧 1–6 为您提供了一个很好的起点,让您可以确定图数据模型中的顶点、边或属性。在开始本章的实施细节之前,我们还有最后一部分数据建模最佳实践需要考虑。
Together, tips #1–6 give you a great starting point for identifying what will be a vertex, an edge, or a property in your graph data model. We have one last section of data modeling best practices to consider before we start the implementation details for this chapter.
下一节中列出的是常见错误。每个错误后面都附有我们的“糟糕-更好-最佳”建议。
The callouts in the upcoming section are common mistakes. Each mistake is followed by our bad-better-best recommendations.
就代码库中应该命名和维护的内容达成共识非常困难。我们有三个主题,团队通常会浪费宝贵的时间来讨论如何在图数据模型中处理命名约定。
Arriving at a consensus on what something should be named and maintained with your codebase is surprisingly difficult. We have three topics on which teams commonly waste their valuable time in bikeshedding how to address naming conventions in their graph data model.
使用单词has作为边标签。
Using the word has as an edge label.
一我们看到的最常见错误之一是用标签来命名所有边,如图 4-7has左侧所示。这是一个命名错误,因为该词没有提供关于边的用途或方向的有意义的上下文。has
One of the most common mistakes we see comes from naming all of your edges with the label has, as shown on the left side of Figure 4-7. This is a mistake in naming because the word has does not provide meaningful context regarding the edge’s purpose or direction.
如果您的图模型使用has来作为其边缘标签,我们有两个建议给您。更好的边缘标签应采用 形式,如图4-7has_{vertex_label}中间的橙色所示。这种类型的名称可让您在图查询中具有更高的特异性,同时还可提供更有意义的名称以在代码库中维护。
If your graph model uses has for its edge labels, we have two recommendations for you. A better edge label would have the form has_{vertex_label}, as shown in the center in orange in Figure 4-7. This type of name allows you to have more specificity in your graph queries while also providing a more meaningful name to maintain in your codebase.
此问题的首选解决方案如图 4-7最右侧的绿色所示。此建议建议您使用主动动词来传达数据的含义、方向和特异性。我们将使用边标签deposit_to并将withdraw_from交易连接到示例中的帐户。
The preferred solution to this problem is shown in green at far right in Figure 4-7. This recommendation advises you to use an active verb that communicates meaning, direction, and specificity to your data. We are going to use the edge labels deposit_to and withdraw_from to connect transactions to the accounts in our examples.
在选择了有意义的边标签后,创建无法唯一标识数据的属性名称也是一个常见错误。这引出了属性图建模中的下一个陷阱。
After meaningful edge labels have been selected, it is also a common mistake to create property names that do not help uniquely identify your data. This brings us to our next pitfall in property graph modeling.
使用该词id作为属性。
Using the word id as a property.
哪些数据可以唯一地标识一个实体的概念是一个很深奥的话题。使用名为 的属性键id是一个糟糕的决定,因为它没有描述其所指的内容。此外,id它与 Apache Cassandra 中的内部命名约定存在命名冲突,并且不受 DataStax Graph 的支持。
The concept of which pieces of data uniquely identify an entity is a deep topic. Using a property key called id is a bad decision because it is not descriptive of what it is referring to. Additionally, id is a naming clash with the internal naming conventions within Apache Cassandra and is not supported in DataStax Graph.
一个稍微好一点的惯例是使用 来命名唯一标识数据的属性,如图 4-8{vertex_label}_id中间所示。我们在整本书中使用了几次这种方法,因为我们使用的是合成示例,如果您使用随机生成的标识符(如 UUID(通用唯一标识符)),这种类型的标识符就完全没问题。但是,当我们处理开源数据时,您会看到我们转向使用更具描述性的标识符。这些标识符表示唯一标识其域内实体的概念,例如社会安全号码、公钥和特定于域的通用唯一标识符。
A slightly better convention would be to name the property that uniquely identifies your data with {vertex_label}_id, as shown at center in Figure 4-8. We use this a few times throughout the book because we are working with synthetic examples, and this type of identifier is perfectly fine if you use randomly generated identifiers, like UUIDs (universally unique identifiers). However, you will see us move to using more descriptive identifiers when we work with open source data. These identifiers represent concepts that uniquely identify entities within their domain, such as social security numbers, public keys, and domain-specific universally unique identifiers.
这给我们带来了在整个应用程序代码库中看到的最后一个也是最重要的错误。
This brings us to the last and debatably most important mistake that we see throughout application codebases.
套管的使用不一致。
Inconsistent use of casing.
当谈到大小写时,最好的方法遵循您编写的语言约定。有些语言有推广的风格指南CamelCase,而其他语言则更喜欢snake_case。对于本书中的示例,我们计划遵循以下大小写和样式:
When it comes to casing, the best approach follows the language conventions that you are writing in. Some languages have style guides that promote CamelCase, whereas others prefer snake_case. For the examples in this book, we plan to follow the following casing and styles:
顶点CamelCase标签大写
Capital CamelCase for vertex labels
snake_case边标签、属性键和示例数据的小写
Lowercase snake_case for edge labels, property keys, and example data
最后一个技巧甚至在图书中提及都感觉有点迂腐。我们之所以提到它,是因为命名约定的一致性往往会被遗忘,从而在将图技术投入生产的最后阶段为团队带来昂贵的障碍。这些技巧对您的团队来说越微不足道,您就越有可能记住它们。
This last tip feels a bit pedantic to even bring up in a graph book. We are mentioning it because consistency in naming conventions tends to be forgotten, creating expensive roadblocks for teams during the last stretch of getting their graph technology into production. The more trivial these tips seem to your team, the better off you probably already are in making sure to remember them.
上一个图数据建模的讨论说明了我们如何分解第一个查询以演化第 3 章中的示例。在本节中,我们希望构建数据模型中的其余元素,以回答本章示例的所有问题。
The previous discussion of graph data modeling illustrated how we broke down our first query to evolve the example from Chapter 3. In this section, we want to build up the remaining elements in our data model to answer all the questions for this chapter’s example.
本章中的示例添加了模式和数据,使我们的应用程序能够回答以下三个问题:
The example in this chapter adds schema and data that enable our application to answer the following three questions:
迈克尔账户最近涉及的 20 笔交易有哪些?
What are the most recent 20 transactions involving Michael’s account?
12 月份,迈克尔在哪些供应商处购物?频率如何?
In December, at which vendors did Michael shop, and with what frequency?
查找并更新 Jamie 和 Aaliyah 最看重的交易:从他们的账户支付到他们的抵押贷款。
Find and update the transactions that Jamie and Aaliyah most value: their payments from their account to their mortgage loan.
我们已经逐步介绍了如何对第一个问题进行建模。让我们仔细看看。
We have already stepped through how to model the first question. Let’s take a closer look at it.
图 4-9中的图模式将我们为回答第一个问题而建立的原则应用到图数据模型中。新的顶点标签为Transaction,顶点有两个新的边标签Account:withdraw_from和deposit_to。我们讨论了如何以及在何处在我们的图中对时间进行建模,您可以在图 4-9中看到顶点timestamp上的Transaction图。
The graph schema in Figure 4-9 applies the principles we built up to answer the first question into a graph data model. The new vertex label is Transaction, with two new edge labels to the Account vertex: withdraw_from and deposit_to, respectively. We discussed how and where to model time in our graph, which you see in Figure 4-9 with timestamp on the Transaction vertex.
接下来,让我们通过对查询进行建模来考虑本章中的示例的剩余问题:
Next, let’s consider this chapter’s remaining questions for our example in this chapter by modeling the queries:
1. 12 月份,迈克尔在哪些供应商处购物?频率如何?
2. 查找并更新 Jamie 和 Aaliyah 最看重的交易:
他们将通过自己的账户向抵押贷款支付款项。1. In December, at which vendors did Michael shop, and with what frequency?
2. Find and update the transactions that Jamie and Aaliyah most value:
their payments from their account to their mortgage loan.
为了得到这些问题的数据模型,让我们运用我们在“图数据建模 101”中介绍的思维过程。根据那里的建议,我们提出了关于交易的三个陈述:
To arrive at a data model for these questions, let’s apply the thought processes we introduced in “Graph Data Modeling 101”. Following the advice there, we came up with three statements about transactions:
交易收取信用卡费用。
Transactions charge credit cards.
交易支付给供应商。
Transactions pay vendors.
交易支付贷款。
Transactions pay loans.
从这些语句中,我们可以找到其余所需的架构元素。首先,我们需要一个新的顶点标签来表示客户在哪里购物:Vendor。接下来,我们需要一个边标签,,表示到或顶点标签的pay交易。最后,我们需要另一个边标签,,以指示交易收取信用卡费用。LoanVendorcharge
From these statements, we can find the rest of our required schema elements. First, we need a new vertex label to represent where our customers shop: Vendor. Next, we need an edge label, pay, for a transaction to the Loan or Vendor vertex labels. Last, we need another edge label, charge, to indicate that a transaction charges a credit card.
将所有这些放在一起,我们得到了如图 4-10所示的模式。
Bringing all of this together, we have the schema shown in Figure 4-10.
我们减少了全部图数据建模的视角仅包括我们当前示例所需的实践。除了这些核心原则之外,您还会发现这里未涵盖的有关数据的边缘情况。这是意料之中的。我们正在教授一种思维过程,并选择了这里的原则作为像图一样建模数据的入门指南。
We reduced the full perspective on graph data modeling to include only the practices that we need for our current example. Beyond these core principles, you will find edge cases about your data that are not covered here. That is expected. We are teaching a thought process and selected the principles here as a starting guide for modeling your data like a graph.
如果我们可以确保您理解图数据建模的一个概念,那么它将是:将数据建模为图既是一门艺术,也是一门工程。数据建模过程的艺术包括创建和发展您对数据的看法。这种演变将您的思维方式转化为关系优先的数据建模范式。
If we could ensure you understood one concept about graph data modeling, it would be the following: modeling your data as a graph is just as much of an art as it is engineering. The art of the data modeling process involves creating and evolving your perspective on your data. This evolution translates your mindset into the paradigm of relationship-first data modeling.
当你在本书或自己的工作中发现新的建模案例时,请就你正在建模的内容提出以下问题,以帮助发展你自己的推理能力:
When you find new modeling cases in this book or in your own work, ask the following questions about what you are modeling to help develop your own reasoning:
这个概念对于应用程序的最终用户意味着什么?
What does this concept mean to the end user of the application?
您将如何在应用程序中读取这些数据?
How are you going to read this data in your application?
定义数据模型是将图思维应用于应用程序的第一步。专注于您可以集成的数据、您想要提出的查询以及这对最终用户意味着什么。将这三个概念结合起来,阐明了我们如何在应用程序中查看、建模和使用图数据。
Defining your data model is the first step in applying graph thinking to your application. Focus on the data you can integrate, the queries you want to ask, and what this will mean to your end user. When combined, those three concepts articulate how we see, model, and use graph data within an application.
为了帮助您学习并运用我们的观点来构建您自己的图模型,让我们来了解数据、查询和最终用户的重要性。
To help you learn and apply our perspective to building your own graph model, let’s walk through the importance of data, queries, and the end user.
我们的第一条建议是关注您拥有的数据。通过对整个行业的图问题进行建模,很容易导致大海翻滚;避免这个兔子洞!如果您始终专注于使用应用程序将要使用的数据进行生产,您的图模型将会不断发展。
Our first piece of advice is to focus on the data you have. It is easy to boil the ocean by modeling your industry’s entire graph problem; avoid this rabbit hole! Your graph model will evolve if you keep centered on getting to production with the data with which your application will be working.
第二,应用查询驱动设计的实践。构建数据模型以仅容纳一组预定义的图查询。我们在这个主题上遇到的一个常见干扰是那些旨在对图中任何可发现数据创建开放遍历的应用程序。出于开发目的,探索和发现的能力是有意义的。然而,对于生产用途,具有开放遍历访问权限的应用程序可能会带来无数问题。
Second, apply the practice of query-driven design. Build your data model to accommodate only a predefined set of graph queries. A common red herring we run into on this topic is those applications that aim to create open traversals across any discoverable data in a graph. For developmental purposes, the ability to explore and discover makes sense. However, for production use, an application with open traversal access can introduce a myriad of concerns.
出于安全、性能和维护方面的原因,我们强烈建议团队不要创建具有无限制和无限遍历的生产平台。我们看到的警告信号是您的图应用程序缺乏特异性。我们知道,当您首次探索图数据时,这种观点很难应用。我们认为这里的界限是设定您在开发期间想要做的事情与您想要在分布式生产应用程序中推向生产的事情之间的期望。
For security, performance, and maintenance implications, we strongly advise teams not to create production platforms with unbounded and unlimited traversals. The warning sign we see is a lack of specificity for your graph application. We know this perspective is very hard to apply when you are first exploring graph data. We see the line here as setting expectations between what you want to do during development versus what you want to push to production in a distributed production application.
最后,最重要的是,您必须考虑数据对最终用户意味着什么。从选择命名约定到图中的对象,一切都将由其他人解释:您的团队成员或应用程序用户。命名约定和图对象由您的工程团队成员解释和维护;请明智地选择他们。
Last and most importantly, you have to consider what the data means to your end user. Everything from selecting naming conventions to the objects in your graph will be interpreted by someone else: your team members or your application users. Naming conventions and graph objects are interpreted and maintained by your engineering team members; choose them wisely.
最终,您的图数据将通过您的应用程序呈现给最终用户。花时间设计您的数据架构、模型和查询,以呈现对他们最有意义的信息。
Ultimately, your graph data will be presented to an end user through your application. Spend time designing your data architecture, models, and queries to present information that is most meaningful to them.
结合这三个概念,我们可以清楚地了解我们在应用程序中如何查看、建模和使用图数据。同样,这三个概念是使用您拥有的数据进行构建、遵循查询驱动的设计以及为最终用户进行设计。遵循这些设计原则将有助于您在那些艰难的数据建模讨论中摆脱困境,并让您的应用程序成为业界有史以来对图数据的最佳利用。
When combined, these three concepts articulate how we see, model, and use graph data within an application. Again, the three concepts are to build with the data you have, follow query-driven design, and design for your end user. Following these design principles will help get you unstuck during those difficult data modeling discussions and prepare your application to be the best use of graph data the industry has ever seen.
图 4-10中的模式只需要两个新的顶点标签:Transaction和Vendor。你需要做的之前我练习过几次如何将架构图转换为代码。我们在图 4-10中展示了架构,在示例 4-1中展示了代码。
Our schema from Figure 4-10 requires only two new vertex labels: Transaction and Vendor. What you have practiced a few times prior to now is how to take a schema drawing and translate it into code. We showed the schema in Figure 4-10, and in Example 4-1 we show you the code.
schema.vertexLabel("Transaction").ifNotExists().partitionBy("transaction_id",Int).property("transaction_type",Text).property("timestamp",Text).create();schema.vertexLabel("Vendor").ifNotExists().partitionBy("vendor_id",Int).property("vendor_name",Text).create();
schema.vertexLabel("Transaction").ifNotExists().partitionBy("transaction_id",Int).property("transaction_type",Text).property("timestamp",Text).create();schema.vertexLabel("Vendor").ifNotExists().partitionBy("vendor_id",Int).property("vendor_name",Text).create();
如果您想知道,我们将使用Text时间戳作为数据类型,以便在我们接下来的示例中更轻松地讲授概念。我们将使用 ISO 8601 标准格式存储为文本。
In case you are wondering, we are using Text as the data type for timestamp to make it easier to teach concepts in our upcoming examples. We will be using the ISO 8601 standard format stored as text.
除了这些顶点标签之外,我们还在Transaction此图中添加了顶点与其他顶点标签之间的关系。让我们从Transaction和顶点标签之间的新边标签开始。新边标签的架构代码如示例 4-2Account所示。
In addition to these vertex labels, we added relationships between the Transaction vertex and the other vertex labels in this graph. Let’s start with the new edge labels between the Transaction and Account vertex labels. The schema code for the new edge labels is shown in Example 4-2.
schema.edgeLabel("withdraw_from").ifNotExists().from("Transaction").to("Account").create();schema.edgeLabel("deposit_to").ifNotExists().from("Transaction").to("Account").create();
schema.edgeLabel("withdraw_from").ifNotExists().from("Transaction").to("Account").create();schema.edgeLabel("deposit_to").ifNotExists().from("Transaction").to("Account").create();
这两条边模拟了资金如何从银行账户转入和转出。在示例 4-3中,我们在示例中添加了其余边标签:
These two edges model how money moves to and from an account within your bank. In Example 4-3, we add in the rest of the edge labels in our example:
schema.edgeLabel("pay").ifNotExists().from("Transaction").to("Loan").create();schema.edgeLabel("charge").ifNotExists().from("Transaction").to("CreditCard").create();schema.edgeLabel("pay").ifNotExists().from("Transaction").to("Vendor").create();
schema.edgeLabel("pay").ifNotExists().from("Transaction").to("Loan").create();schema.edgeLabel("charge").ifNotExists().from("Transaction").to("CreditCard").create();schema.edgeLabel("pay").ifNotExists().from("Transaction").to("Vendor").create();
最后三个边标签完成了我们在示例中描述资产之间交易所需的边。
These last three edge labels complete the edges we will need to describe transactions between the assets in our example.
随着示例的增加,数据也随之增加。我们编写了一个小型数据生成器来扩展第 3 章中的数据,以包含图 4-10中的数据模型。如果您对本章的数据生成过程感兴趣,您有两个选择。
As examples grow, so too does the data. We wrote a small data generator to expand the data from Chapter 3 to include our data model from Figure 4-10. If you are interested in the data generation process for this chapter, you have two options.
您的第一个选择是使用 bash 脚本重新加载您将在接下来的示例中看到的完全相同的数据。我们将在第 5 章中教您有关此工具和流程的知识,但欢迎您在 GitHub 存储库中预览加载脚本。如果您希望本地运行的示例与我们在文本中显示的结果相匹配,我们建议在本书中使用脚本。
Your first option is to use the bash scripts to reload the exact same data you will see in the upcoming examples. We will teach you about this tool and process in Chapter 5, but you are welcome to preview the loading script in the GitHub repository. We recommend using the scripts throughout this book if you would like the examples you are running locally to match the results we show in the text.
您的第二个选择是深入研究并执行我们的数据生成代码。我们在 一个名为的单独 Studio NotebookCh4_DataGeneration中提供了我们的代码。如果您想深入了解使用 Gremlin 创建虚假数据以及我们使用的方法,我们推荐此选项。
Your second option is to dive into and execute our data generation code. We provided our code in a separate Studio Notebook called Ch4_DataGeneration. We recommend this option if you want to dig into creating fake data with Gremlin and the methods we used.
如果您在 Studio Notebook 中重新运行数据插入过程,本地图中的结果将与本文中打印的结果不完全匹配。如果您希望数据精确匹配,我们建议通过 DataStax Bulk Loader 导入完全相同的图结构。您可以在随附的技术资料中找到所有这些内容。
If you rerun the data insertion process in your Studio Notebook, the results in your local graph will not precisely match the results printed in this text. If you want the data to match precisely, we recommend importing the exact same graph structure via DataStax Bulk Loader. You will find all of this in the accompanying technical materials.
到目前为止,我们已经完成了很多任务。我们探索了第一组数据建模技巧,创建了一个开发模型,查看了模式代码,并插入了数据。
Up to this point, we have accomplished many tasks. We explored our first set of data modeling tips, created a development model, looked at the schema code, and inserted data.
最后一个主要任务是使用 Gremlin 查询语言来遍历我们的模型并回答有关我们数据的问题。
The last main task is to use the Gremlin query language to walk around our model and answer questions about our data.
The main objective of this chapter is to illustrate a real-world graph schema that walks through multiple neighborhoods of graph data.
为了供您参考,我们将在本书中交替使用walk、navigate和traverse这几个词来表示我们正在编写图查询。
For your reference, we will use the words walk, navigate, and traverse interchangeably throughout this book to mean that we are writing graph queries.
本章到目前为止的所有内容都是为了回答以下三个问题在这个部分:
Everything in this chapter up until now was required to set up answering the following three questions in this section:
迈克尔账户最近涉及的 20 笔交易有哪些?
What are the most recent 20 transactions involving Michael’s account?
12 月份,迈克尔在哪些供应商处购物?频率如何?
In December, at which vendors did Michael shop, and with what frequency?
查找并更新 Jamie 和 Aaliyah 最看重的交易:从他们的账户支付到他们的抵押贷款。
Find and update the transactions that Jamie and Aaliyah most value: their payments from their account to their mortgage loan.
让我们来看看查询及其结果。然后,在本章的最后一节“高级 Gremlin”中,我们将更深入地探讨如何塑造结果负载。
Let’s walk through the queries and their results. Then, in the chapter’s final section on Advanced Gremlin, we will delve a bit deeper into how to shape the result payload.
我们建议您在练习接下来几节中的查询时找到参考图 4-10 的方法。我们建议这样做,因为您的架构就像您的地图;您需要知道您在哪里,这样您才能朝着正确的方向走向目的地。
Our recommendation is that you find a way to reference Figure 4-10 as you practice the queries in the upcoming sections. We recommend doing this because your schema functions as your map; you need to know where you are so that you can walk in the right direction to your destination.
让我们从示例 4-4中的一些伪代码开始,思考如何通过数据来回答第一个问题。
Let’s start with some pseudocode in Example 4-4 to think about how we are going to walk through our data to answer this first question.
Question:Whatarethemostrecent20transactionsinvolvingMichael's account?Process:Start at Michael'scustomervertexWalktohisaccountWalktoalltransactionsSortthembytime,descendingReturnthetop20transactionids
Question:Whatarethemostrecent20transactionsinvolvingMichael's account?Process:Start at Michael'scustomervertexWalktohisaccountWalktoalltransactionsSortthembytime,descendingReturnthetop20transactionids
我们使用示例 4-4中概述的流程来创建示例 4-5中的 Gremlin 查询。
We used the process outlined in Example 4-4 to create the Gremlin query in Example 4-5.
1dev.V().has("Customer","customer_id","customer_0").// the customer2out("owns").// walk to his account3in("withdraw_from","deposit_to").// walk to all transactions4order().// sort the vertices5by("timestamp",desc).// by their timestamp, descending6limit(20).// filter to only the 20 most recent7values("transaction_id")// return the transaction_ids
1dev.V().has("Customer","customer_id","customer_0").// the customer2out("owns").// walk to his account3in("withdraw_from","deposit_to").// walk to all transactions4order().// sort the vertices5by("timestamp",desc).// by their timestamp, descending6limit(20).// filter to only the 20 most recent7values("transaction_id")// return the transaction_ids
结果样本:
A sample of the results:
"184","244","268",...
"184","244","268",...
让我们一步一步地深入研究这个查询。
Let’s dig into this query one step at a time.
在第 1 行,dev.V().has("Customer", "customer_id", "customer_0")根据其唯一标识符查找顶点。然后在第 2 行,该步骤out("owns")沿着传出owns边走到Account此客户的顶点。在本例中,Michael 只有一个帐户。
On line 1, dev.V().has("Customer", "customer_id", "customer_0") looks up a vertex according to its unique identifier. Then on line 2, the step out("owns") walks through the outgoing owns edge to the Account vertices for this customer. In this case, Michael has only one account.
此时,我们想要访问所有交易。在第 3 行,该in("withdraw_from", "deposit_to")步骤就是这样做的:我们遍历传入的边标签来访问交易。在第 4 行,我们位于交易顶点上。
At this point, we want to access all transactions. On line 3, the in("withdraw_from", "deposit_to") step does just that: we walk through the incoming edge labels to access transactions. At line 4, we are on the transaction vertices.
我们在“图中交易建模的演变”中遗漏了一个细节,现在我们想提出来。示例 4-5中第 3 行的简单性也是我们设计数据模型中边线的部分动机。当边线朝不同方向移动时,第一个查询的编写和推理难度要大得多。
We left a detail out of “An evolution of modeling transactions in a graph” that we want to bring up now. The simplicity of line 3 in Example 4-5 was also part of the motivation that led to how we designed the edges in our data model. This first query was much harder to write and reason about when the edges were going in different directions.
第 4 行的步骤order()表示我们需要为顶点(即交易)提供某种顺序。我们在第 5 行使用by("timestamp", desc)步骤指定排序顺序。这意味着我们将Transaction根据时间戳访问、合并和排序所有顶点。然后,我们只想选择最近的 20 个顶点limit(20)。最后,在第 7 行,我们想要访问transaction_ids,因此我们通过步骤选择它们values("transaction_id")。
The order() step on line 4 indicates that we need to provide some sort of order to the vertices, which are transactions. We specify the sort order on line 5 with the by("timestamp", desc) step. This means that we are going to access, merge, and sort all Transaction vertices according to their timestamp. Then we want to select only the 20 most recent vertices with limit(20). Last, on line 7, we want to get access to the transaction_ids, so we select them via the values("transaction_id") step.
transaction_id此查询将返回包含客户所有帐户中最近 20 笔交易的值列表。
This query will return a list of values that contains the transaction_id for each of the 20 most recent transactions across all of the customer’s accounts.
想象一下,对于最终用户来说,这将有多么强大。他们将能够看到与他们最相关的详细信息,而不必浏览多个屏幕来在脑海中汇总这些数据。这种类型的查询对于了解如何根据客户最关心的内容个性化您的应用程序至关重要。
Imagine how much more powerful this would be to display for the end user. They would be able to see the details that are most relevant to them instead of navigating multiple screens to join this data together in their head. This type of query is vital in understanding how to personalize your application to what a customer most cares about.
对于第二个问题,让我们从示例 4-6中的查询大纲开始,思考如何通过数据来回答这个问题。
For this second question, let’s start with an outline of the query in Example 4-6 to think about how we are going to walk through our data to answer the question.
Question:InDecember2020,atwhichvendorsdidMichaelshop,andwithwhatfrequency?Process:StartatMichael'scustomervertexWalktohiscreditcardWalktoalltransactionsOnlyconsidertransactionsinDecember2020WalktothevendorsforthosetransactionsGroupandcountthembytheirname
Question:InDecember2020,atwhichvendorsdidMichaelshop,andwithwhatfrequency?Process:StartatMichael'scustomervertexWalktohiscreditcardWalktoalltransactionsOnlyconsidertransactionsinDecember2020WalktothevendorsforthosetransactionsGroupandcountthembytheirname
我们在示例 4-7中开始示例 4-6中概述的过程,并在示例 4-8中完成它。在准备此查询时,我们在数据中使用了 ISO 8601 时间戳标准化,以便更轻松地按日期排列。在 ISO 8601 标准中,时间戳通常格式化为,其中表示 2020 年 12 月初。YYYY-MM-DD’T’hh:mm:ss’Z’2020-12-01T00:00:00Z
We start the process outlined in Example 4-6 in Example 4-7 and complete it in Example 4-8. In preparation for this query, we used the ISO 8601 timestamp standardization in our data to make it easier to range on dates. In the ISO 8601 standard, timestamps are commonly formatted as YYYY-MM-DD’T’hh:mm:ss’Z’, where 2020-12-01T00:00:00Z represents the very beginning of December in 2020.
1dev.V().has("Customer","customer_id","customer_0").// the customer2out("uses").// Walk to his credit card3in("charge").// Walk to all transactions4has("timestamp",// Only consider transactions5between("2020-12-01T00:00:00Z",// in December 20206"2021-01-01T00:00:00Z")).7out("pay").// Walk to the vendors8groupCount().// group and count them9by("vendor_name")// by their name
1dev.V().has("Customer","customer_id","customer_0").// the customer2out("uses").// Walk to his credit card3in("charge").// Walk to all transactions4has("timestamp",// Only consider transactions5between("2020-12-01T00:00:00Z",// in December 20206"2021-01-01T00:00:00Z")).7out("pay").// Walk to the vendors8groupCount().// group and count them9by("vendor_name")// by their name
结果是:
The results are:
{"Nike":"2","Amazon":"1","Target":"3"}
{"Nike":"2","Amazon":"1","Target":"3"}
随机化会影响查询 2 的结果。如果您使用数据生成过程而不是加载数据,则图的结构可能会略有不同,因此查询 2 的计数也会有所不同。
Randomization affects the results of query 2. If you use the data generation process instead of loading the data, your graph may have a slightly different structure and therefore different counts for query 2.
示例 4-7的设置遵循与之前类似的访问模式,我们从客户开始,然后遍历到相邻顶点。我们从customer_0他们的信用卡开始,然后走到交易。在第 4 行到第 6 行,我们使用一种在遍历过程中过滤数据的方法。在这里,我们根据特定范围内的时间戳过滤所有顶点。具体来说,has("timestamp", between("2020-12-01T00:00:00Z", "2021-01-01T00:00:00Z"))对 2020 年 12 月期间具有时间戳的所有交易进行排序并返回。
The setup for Example 4-7 follows a similar access pattern as before, where we start at a customer and then traverse to a neighboring vertex. We start at customer_0 and walk to their credit cards and then to transactions. On lines 4 through 6, we are using a way to filter your data during a traversal. Here, we are filtering all vertices according to their timestamps in a specific range. Specifically, has("timestamp", between("2020-12-01T00:00:00Z", "2021-01-01T00:00:00Z")) sorts and returns all transactions that have a timestamp during the month of December in the year 2020.
在第 7 行,按照我们的模式,我们使用 step 走向供应商out("pay")。最后,我们想要返回供应商的名称以及与该供应商的交易次数。我们在第 8 行和第 9 行使用 执行此操作groupCount().by("vendor_name")。
At line 7, following our schema, we walk to the vendors with the out("pay") step. Finally, we want to return the vendor’s name along with how many times a transaction was observed with that vendor. We do this on lines 8 and 9 with groupCount().by("vendor_name").
除此之外between,表 4-1列出了可以用于确定值范围的最常用谓词。完整的谓词表请参阅 Kelvin Lawrence 的书。2
In addition to between, Table 4-1 lists the most popular predicates you can use to range on values. Please refer to the book by Kelvin Lawrence for the full table of predicates.2
| 谓词 | 用法 |
|---|---|
等价 eq |
等于 Equal to |
等价于 neq |
不等于 Not equal to |
劑量 gt |
大于 Greater than |
特 gte |
大于或等于 Greater than or equal to |
特 lt |
少于 Less than |
LTE lte |
小于或等于 Less than or equal to |
之间 between |
两个值之间(不包括上限) Between two values excluding the upper bound |
您可能想知道:如果我们想对示例 4-7的输出进行排序该怎么办?
You may be wondering: what if we wanted to order the output of Example 4-7?
如果要按降序返回结果,可以通过添加order().by()模式来实现,如下所示在示例 4-8中的第 10 行和第 11 行。
If you wanted to return the results in a decreasing order, you would do that by adding in the order().by() pattern, shown on lines 10 and 11 in Example 4-8.
1dev.V().has("Customer","customer_id","customer_0").2out("uses").3in("charge").4has("timestamp",5between("2020-12-01T00:00:00Z",6"2021-01-01T00:00:00Z")).7out("pay").8groupCount().9by("vendor_name").10order(local).// Order the map object11by(values,desc)// according to the groupCount map's values
1dev.V().has("Customer","customer_id","customer_0").2out("uses").3in("charge").4has("timestamp",5between("2020-12-01T00:00:00Z",6"2021-01-01T00:00:00Z")).7out("pay").8groupCount().9by("vendor_name").10order(local).// Order the map object11by(values,desc)// according to the groupCount map's values
现结果如下:
The results are now:
{"Target":"3","Nike":"2","Amazon":"1"}
{"Target":"3","Nike":"2","Amazon":"1"}
我们加入了在第 10 行的遍历中使用步骤的范围order(local)。
We threw in the use of scope in a traversal at line 10 with the step order(local).
Scope determines whether the particular operation is to be performed to the current object (local) at that step or to the entire stream of objects up to that step (global).
为了直观地解释遍历的范围,请考虑图 4-11。
For a visual explanation of scope in a traversal, consider Figure 4-11.
简单解释一下,在第 9 行的末尾,我们需要对管道中的对象进行排序,这是一个映射。local第 10 行的用法告诉遍历对映射对象中的项目进行排序和排序。另一种思考方式是,我们想要对映射中的条目进行排序。我们通过指示范围是对象本身的本地范围来实现这一点。
To explain it simply, at the end of line 9, we needed to order the object in the pipeline, which is a map. The use of local on line 10 tells the traversal to sort and order the items within the map object. Another way to think about this is that we want to order the entries within the map. We do that by indicating that the scope is local to the object itself.
了解遍历范围的最佳方法是使用 Studio Notebook 中的不同查询,并查看范围如何影响结果的形状。有关了解数据流和对象类型的更多精彩可视化图,请参阅DataStax Graph 文档页面。
The best way to understand traversal scope is to play with different queries in your Studio Notebook and see how the scope affects the shape of your results. More great visual diagrams on understanding the flow of data and object types are available on the DataStax Graph documentation pages.
如果您在开发 Gremlin 遍历的过程中对对象类型存疑,请将其添加.next().getClass()到遍历开发中的位置。这将检查遍历中此时的对象并为您提供它们的类。
If you ever question what object type you have in the middle of developing a Gremlin traversal, add .next().getClass() to where you are in your traversal development. This will inspect the objects at this point in your traversal and give you their class.
当我们遍历多个数据邻域时,使用图数据库的优势才真正开始显现,正如我们将在第三个也是最后一个查询中所做的那样。在这里,我们将访问和改变图中五个数据邻域中的数据。我们将把这个查询分为三个步骤:访问、改变,然后验证。
The advantage of using a graph database really starts to show as we walk through multiple neighborhoods of data, as we will be doing with this third and last query. Here, we are accessing and mutating data across five neighborhoods of data in our graph. We are going to break this query down into three steps: access, mutation, and then validation.
我们要对帐户进行的第一个简化是缩小查询范围。我们知道 Jamie 和 Aaliyah 只共享一个帐户:acct_0。因此,为了进一步简化查询,我们可以只关注一个人的步行;我们选择 Aaliyah。
The first simplification we are going to make to our account is to reduce the scope of the query. We know that Jamie and Aaliyah share only one account: acct_0. Therefore, to further simplify our query, we can focus on walking from only one person; we choose Aaliyah.
这将引出我们想要构建的第一个较短的查询:
This brings us to the first shorter query we want to build:
在更新重要交易之前,我们需要找到重要的交易。我们要找的交易是那些表明从 Aaliyah 的联名账户向 Jamie 和 Aaliyah 的抵押贷款支付贷款的交易。让我们在示例 4-9中以伪代码概述我们的方法,以思考如何遍历数据来回答这个问题。
Before we can update important transactions, we need to find the important ones. The transactions we are looking for are those that indicate a loan payment from Aaliyah’s joint account to Jamie and Aaliyah’s mortgage. Let’s outline our approach in pseudocode in Example 4-9 to think about how we are going to walk through our data to answer the question.
Question:FindAaliyah's transactions that are loan paymentsProcess:Start at Aaliyah'scustomervertexWalktoheraccountWalktotransactionsthatarewithdrawalsfromtheaccountGototheloanverticesGroupandcounttheloanvertices
Question:FindAaliyah's transactions that are loan paymentsProcess:Start at Aaliyah'scustomervertexWalktoheraccountWalktotransactionsthatarewithdrawalsfromtheaccountGototheloanverticesGroupandcounttheloanvertices
我们使用示例 4-9中概述的流程来创建示例 4-10中的 Gremlin 查询。
We used the process outlined in Example 4-9 to create the Gremlin query in Example 4-10.
1dev.V().has("Customer","customer_id","customer_4").// accessing Aaliyah's vertex2out("owns").// walking to the account3in("withdraw_from").// only consider withdraws4out("pay").// walking out to loans or vendors5hasLabel("Loan").// limiting to only loan vertices6groupCount().// groupCount the loan vertices7by("loan_id")// by their loan_id
1dev.V().has("Customer","customer_id","customer_4").// accessing Aaliyah's vertex2out("owns").// walking to the account3in("withdraw_from").// only consider withdraws4out("pay").// walking out to loans or vendors5hasLabel("Loan").// limiting to only loan vertices6groupCount().// groupCount the loan vertices7by("loan_id")// by their loan_id
样本数据的结果如下:
The results for the sample data will look like:
{"loan80":"24","loan18":"24"}
{"loan80":"24","loan18":"24"}
让我们逐步完成示例 4-10。在第 1 行,我们首先访问客户并转到他们的帐户。在第 2 行,我们遍历到 Aaliyah 的帐户。回顾架构,我们遍历传入边缘withdraw_from以在第 3 行访问帐户提款。
Let’s step through Example 4-10. On line 1, we start by accessing the customer and walking to their account. On line 2, we traverse to Aaliyah’s account. Recalling the schema, we walk through the incoming edge withdraw_from to access account withdrawals on line 3.
在第 4 行,我们遍历pay边标签以到达Loan或Vendor顶点。hasLabel("Loan")第 5 行的步骤是一个过滤器,它消除此时所有不是贷款的顶点。这意味着我们现在只考虑从账户中付款的资产,并且是贷款。在第 6 行,我们根据贷款顶点的唯一标识符对其进行分组和计数,如第 7 行所示。
On line 4, we walk through the pay edge label to arrive at either Loan or Vendor vertices. The hasLabel("Loan") step on line 5 is a filter that eliminates all vertices at this point that are not loans. This means we are now considering only the assets into which a payment has been made from the account and that are loans. On line 6, we group and count those loan vertices according to their unique identifier, as indicated on line 7.
结果有效负载表明该账户已对系统内的每笔贷款进行了 24 次付款。
The result payload indicates that this account has made 24 payments into each loan within the system.
接下来,我们要更进一步,更新此遍历中的数据,以指示哪些交易是抵押贷款支付。
Next, we want to go a step further and update the data in this traversal to indicate which transactions are mortgage payments.
完成此查询所需的遍历是变异遍历。我们所说的变异遍历是指它在遍历过程中更新图中的数据。示例 4-11展示了如何使用上面的遍历在从帐户进入的交易上写入属性loan_18,因为loan_18这是 Jamie 和 Aaliyah 的抵押贷款。
The traversal required to accomplish this query is a mutating traversal. All we mean by mutating traversal is that it updates data in the graph as a part of the traversal. Example 4-11 shows how we can use the traversal above to write properties on the transactions that go from the account and into loan_18, because loan_18 is Jamie and Aaliyah’s mortgage loan.
1dev.V().has("Customer","customer_id","customer_4").// accessing Aaliyah's vertex2out("owns").// walking to the account3in("withdraw_from").// only consider withdraws4filter(5out("pay").// walking to loans or vendors6has("Loan","loan_id","loan_18")).// only keep loan_187property("transaction_type",// mutating step: set the "transaction_type"8"mortgage_payment").// to "mortgage_payment"9values("transaction_id","transaction_type")// return transaction & type
1dev.V().has("Customer","customer_id","customer_4").// accessing Aaliyah's vertex2out("owns").// walking to the account3in("withdraw_from").// only consider withdraws4filter(5out("pay").// walking to loans or vendors6has("Loan","loan_id","loan_18")).// only keep loan_187property("transaction_type",// mutating step: set the "transaction_type"8"mortgage_payment").// to "mortgage_payment"9values("transaction_id","transaction_type")// return transaction & type
结果是:
The results are:
"144","mortgage_payment","153","mortgage_payment","132","mortgage_payment",...
"144","mortgage_payment","153","mortgage_payment","132","mortgage_payment",...
示例 4-11 的开头与查询的第一部分相同。此遍历的新部分跨越第 4 行至第 6 行,其中包含步骤filter(out("pay").has("Loan", "loan_id", "loan_18"))。在这里,我们只允许与顶点相关的交易loan_18继续沿着管道进行。这是因为loan_18是 Jamie 和 Aaliyah 的抵押贷款。在第 7 行,我们通过将“transaction_type”更改为“mortgage_payment”来改变交易顶点。在第 9 行的遍历结束时,我们希望返回transaction_id及其新属性transaction_type。
Example 4-11 starts the same as the first part of our query. The new portion of this traversal spans lines 4 through 6 with the filter(out("pay").has("Loan", "loan_id", "loan_18")) steps. Here, we allow only the transactions that are connected to the loan_18 vertex to continue down the pipeline. This is because loan_18 is Jamie and Aaliyah’s mortgage loan. On line 7, we mutate the transaction vertices by changing “transaction_type” to “mortgage_payment.” At the end of this traversal on line 9, we want to return the transaction_id along with its new property, its transaction_type.
此时,确保我们没有用 更新 Aaliyah 的所有交易非常有帮助mortgage_payment。我们可以通过快速检查来做到这一点,如示例 4-12所示。
At this point, it is very helpful to make sure that we did not update all of Aaliyah’s transactions with mortgage_payment. We can do that with a quick check, shown in Example 4-12.
// check that we didn't update every transaction1dev.V().has("Customer","customer_id","customer_4").// at the customer vertex2out("owns").// at the account vertex3in("withdraw_from").// at all withdrawals4groupCount().// group and count the vertices5by("transaction_type")// according to their transaction_type
// check that we didn't update every transaction1dev.V().has("Customer","customer_id","customer_4").// at the customer vertex2out("owns").// at the account vertex3in("withdraw_from").// at all withdrawals4groupCount().// group and count the vertices5by("transaction_type")// according to their transaction_type
Studio Notebook 的结果如下所示。我们unknown在数据加载过程中将其设置为默认值,也显示在 Studio Notebook 中:
The results from the Studio Notebook are shown below. We set unknown as the default value during the data loading process also shown in the Studio Notebook:
{"mortgage_payment":"24","unknown":"47"}
{"mortgage_payment":"24","unknown":"47"}
此查询会进行快速检查,以验证我们是否正确更改了数据。结合第 1 行至第 3 行,我们处理了 Aaliyah 银行账户中的所有交易。groupCount()在第 4 行,我们根据存储在属性中的值对所有这些顶点执行transaction_type。在这里,我们看到我们正确地更新了 的 24 笔抵押贷款交易loan_18。这验证了我们的变异查询正确更新了我们的图结构。
This query does a quick check to validate that we properly mutated our data. Combining lines 1 through 3, we process all of the transactions from Aaliyah’s bank account. At line 4, we do a groupCount() for all of those vertices according to the value stored in the transaction_type property. Here, we see that we correctly updated only the 24 transactions that are mortgage payments to loan_18. This validates that our mutation query properly updated our graph structure.
本节从三个问题开始,最后三个示例使用 Gremlin 查询语言回答了这些问题。
This section started out with three questions, and the last three examples answered them using the Gremlin query language.
我们逐步介绍了基本查询,向您展示了从哪里开始。在开始探索 Gremlin 查询语言的完整灵活性和表现力之前,请先理清基本图步骤。我们始终建议在开发模式下迭代 Gremlin 步骤,以找到完成查询的基本步骤。这意味着我们要求您执行 Gremlin 查询的第 1 行并查看结果。然后执行第 1 行和第 2 行并查看结果,依此类推。
We stepped through the basic queries to show you where to start. Get your basic graph walks ironed out before you start exploring the full flexibility and expressivity of the Gremlin query language. We always recommend iterating through Gremlin steps in development mode to find the basic walks that accomplish your queries. This means we are asking you to execute line 1 of a Gremlin query and look at the results. Then execute lines 1 and 2 and look at the results, and so on.
规划好基本路径后,您可以尝试更高级的 Gremlin。在开发的这个阶段,找到创建特定有效载荷结构以传回端点的方法非常常见。
After you have mapped out your basic walks, you can try out more advanced Gremlin. At this point in development, it is very common to find ways to create specific payload structures to pass back to your endpoint.
我们将在下一节介绍使用 Gremlin 构建 JSON 的最流行策略。
We will cover the most popular strategies for building JSON with Gremlin in the next section.
本节的目标是构建 Gremlin 查询的更高级版本,以回答新的问题:
The goal of this section is to build up a more advanced version of our Gremlin query that answers a new question:
还有其他人与迈克尔共享账户、贷款或信用卡吗?
Is there anyone else who shares accounts, loans, or credit cards with Michael?
我们想引入一个新问题来展示小范围数据中的高级 Gremlin 概念。一旦您理解了这些概念如何应用于这个问题,我们邀请您使用本章附带的笔记本来实现“基本 Gremlin 导航”中介绍的其他查询的概念。
We would like to introduce a new question to demonstrate advanced Gremlin concepts within a small neighborhood of data. Once you understand how these concepts apply to this question, we invite you to use the accompanying notebook for this chapter to implement the concepts for the other queries introduced in “Basic Gremlin Navigation”.
我们将分几个阶段来完善新查询的结果。它们是:
We will work through shaping the results of our new query in a few stages. They are:
使用project()、fold()和unfold()步骤塑造查询结果
Shaping query results with the project(), fold(), and unfold() steps
where(neq())使用模式从结果中删除数据
Removing data from the results with the where(neq()) pattern
coalesce()使用以下步骤规划稳健的结果有效载荷
Planning for robust result payloads with the coalesce() step
对于任何深入探索的人来说深入了解 Gremlin 查询的世界,我们强烈推荐Kelvin Lawrence 所著的《实用 Gremlin:Apache TinkerPop 教程》一书中的详细说明和解释。3
For anyone diving deeper into the world of Gremlin queries, we highly recommend the detail and explanations in the book Practical Gremlin: An Apache TinkerPop Tutorial by Kelvin Lawrence.3
当我们开始编写新查询时,我们喜欢慢慢构建其所需的部分。Gremlin 最有用的步骤之一是步骤project(),因为它可以帮助我们从查询中构建特定的数据映射。 让我们通过定义我们想要在映射中拥有的三个键来开始构建查询:CreditCardUsers、AccountOwners和LoanOwners。
When we start writing a new query, we like to slowly build up its required pieces. One of the most useful Gremlin steps is the project() step, because it helps us build up a specific map of data from our query. Let’s start building our query out by defining the three keys we want to have in our map: CreditCardUsers, AccountOwners, and LoanOwners.
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(constant("name or no owner for credit cards")).4by(constant("name or no owner for accounts")).5by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(constant("name or no owner for credit cards")).4by(constant("name or no owner for accounts")).5by(constant("name or no owner for loans"))
此查询结构是我们正在构建的基础。我们想从此示例中的特定人员开始,即 Michael。然后,我们要创建一个包含三个键的数据结构:CreditCardUsers、AccountOwners和。我们使用第 2 行的步骤LoanOwners创建此映射。project()步骤的参数project()是三个键。对于步骤中的每个键project(),我们希望有一个by()步骤。每个by()调制器都会创建与键关联的值:
This query structure is the base of what we are building toward. We want to start with a specific person in this example—namely Michael. Then we want to create a data structure that will have three keys: CreditCardUsers, AccountOwners, and LoanOwners. We create this map with the project() step on line 2. The arguments to the project() step are the three keys. For each key in the project() step, we want to have a by() step. Each by() modulator creates the values associated to the keys:
第 3 行上的调制器by()将为键创建一个值CreditCardUsers。
The by() modulator on line 3 will create a value for the CreditCardUsers key.
第四行上的调制器by()将为键创建一个值AccountOwners。
The by() modulator on line 4 will create a value for the AccountOwners key.
第 5 行上的调制器by()将为键创建一个值LoanOwners。
The by() modulator on line 5 will create a value for the LoanOwners key.
我们来看看此时的结果:
Let’s take a look at the results at this point:
{"CreditCardUsers":"name or no owner for credit cards","AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
{"CreditCardUsers":"name or no owner for credit cards","AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
这是个很好的工作基础。接下来,让我们浏览一下我们的图结构,开始填充地图中的值。我们将从第一个键的数据开始:查找与 Michael 共享信用卡的人。
This is a good baseline to work from. Next, let’s walk through our graph structure to start to populate the values in our map. We will start with the data for the first key: finding people who share a credit card with Michael.
回想一下我们的模式,我们需要遍历边缘uses才能找到信用卡。然后我们将再遍历边缘uses找到人。之后,我们想要访问他们的名字。在 Gremlin 中,我们将在第 3、4 和 5 行添加此遍历:
Thinking back to our schema, we will need to walk through the uses edge to get to the credit cards. Then we will walk back through the uses edge to get back to people. After that, we want to access their names. In Gremlin, we would add this walk on lines 3, 4, and 5:
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name")).6by(constant("name or no owner for accounts")).7by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name")).6by(constant("name or no owner for accounts")).7by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name")).6by(constant("name or no owner for accounts")).7by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name")).6by(constant("name or no owner for accounts")).7by(constant("name or no owner for loans"))
我们添加的唯一步骤是从 Michael 出发,通过第 3 行的边缘走到他的信用卡。uses然后,在第 4 行,我们回到所有使用该信用卡的人。最终的有效载荷是:
The only steps we added were to walk from Michael out to his credit card via the uses edge on line 3. Then, on line 4, we walk back to all people who use that credit card. The resulting payload is:
{"CreditCardUsers":"Michael","AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
{"CreditCardUsers":"Michael","AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
这证实了我们所知道的事实:Michael 没有与其他人共用任何信用卡。我们预期会在结果集中看到他的名字。
This confirms what we know: Michael didn’t share any credit cards with other people. We expected to see his name in the result set.
现在让我们对地图中的下一个键执行相同的操作:AccountOwners。在这里,我们要走出边缘owns到帐户顶点并回到人员顶点:
Now let’s do the same thing for the next key in our map: AccountOwners. Here, we want to walk out the owns edge to the account vertex and back to the person vertex:
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name")).6by(out("owns").7in("owns").8values("name")).9by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name")).6by(out("owns").7in("owns").8values("name")).9by(constant("name or no owner for loans"))
让我们看看最终的有效载荷:
Let’s look at the resulting payload:
{"CreditCardUsers":"Michael","AccountOwners":"Michael","LoanOwners":"name or no owner for loans"}
{"CreditCardUsers":"Michael","AccountOwners":"Michael","LoanOwners":"name or no owner for loans"}
查看这些数据,我们没有看到我们期望的结果。我们期望看到 Maria 作为 的结果值AccountOwners。Maria 没有出现,因为 Gremlin 很懒惰;它返回第一个结果,而不是所有结果。我们需要添加一个障碍来强制所有结果完成并返回。
Looking at this data, we do not see what we would expect. We expected to see Maria as a resulting value for AccountOwners. Maria does not show up because Gremlin is lazy; it returns the first result, not all results. We need to add a barrier to force all results to finish and return.
我们在这里喜欢使用的障碍是fold()。该fold()步骤将等待找到所有数据,然后将结果汇总到列表中。这是一个好处,因为现在我们可以为我们的应用程序建立特定的数据类型规则。调整后的查询内容如下:
The barrier that we like to use here is fold(). The fold() step will wait for all of the data to be found and then roll up the results into a list. This is a bonus, because now we can build up specific data type rules for our application. The adjusted query reads:
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name").6fold()).7by(out("owns").8in("owns").9values("name").10fold()).11by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name").6fold()).7by(out("owns").8in("owns").9values("name").10fold()).11by(constant("name or no owner for loans"))
最终有效载荷中的数据形状正是我们期望看到的:
The shape of the data in the resulting payload is what we were expecting to see:
{"CreditCardUsers":["Michael"],"AccountOwners":["Michael","Maria"],"LoanOwners":"name or no owner for loans"}
{"CreditCardUsers":["Michael"],"AccountOwners":["Michael","Maria"],"LoanOwners":"name or no owner for loans"}
让我们通过在最后一步中添加语句来完成地图的构建by()。这些语句需要从 Michael 走到他的贷款处,然后再返回。查询和结果集如下:
Let’s complete the construction of our map by adding in the statements in the last by() step. These statements need to walk from Michael out to his loan and then back. The query and result set are:
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name").6fold()).7by(out("owns").8in("owns").9values("name").10fold()).11by(out("owes").12in("owes").13values("name").14fold())
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name").6fold()).7by(out("owns").8in("owns").9values("name").10fold()).11by(out("owes").12in("owes").13values("name").14fold())
{"CreditCardUsers":["Michael"],"AccountOwners":["Michael","Maria"],"LoanOwners":["Michael"]}
{"CreditCardUsers":["Michael"],"AccountOwners":["Michael","Maria"],"LoanOwners":["Michael"]}
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name").6fold()).7by(out("owns").8in("owns").9values("name").10fold()).11by(out("owes").12in("owes").13values("name").14fold())
1dev.V().has("Customer","customer_id","customer_0").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5values("name").6fold()).7by(out("owns").8in("owns").9values("name").10fold()).11by(out("owes").12in("owes").13values("name").14fold())
{"CreditCardUsers":["Michael"],"AccountOwners":["Michael","Maria"],"LoanOwners":["Michael"]}
{"CreditCardUsers":["Michael"],"AccountOwners":["Michael","Maria"],"LoanOwners":["Michael"]}
此时,我们得到了预期的结果。我们看到 Michael 与 Maria 共享一个帐户。我们还看到 Michael 不与任何其他人共享信用卡或贷款。
At this point, we have the expected results. We see that Michael shares an account with Maria. And we see that Michael doesn’t share credit cards or loans with anyone else.
对于某些应用程序来说,返回 Michael 与自己共用一张信用卡的信息并没有什么帮助。让我们深入研究如何从结果载荷中删除 Michael。
For some applications, it isn’t helpful to return that Michael shares a credit card with himself. Let’s dive into how we would remove Michael from this resulting payload.
从结果集中消除迈克尔可能会对您有所帮助。 我们可以通过使用as()步骤存储 Michael 的顶点,然后将其从结果集中消除来实现这一点。您可以使用步骤从管道中删除顶点where(neq ("some_stored_value"))。
It might be useful for you to eliminate Michael from the result set. We can do that by using the as() step to store Michael’s vertex, and then eliminate it from the result set. You can remove a vertex from your pipeline with the step where(neq ("some_stored_value")).
我们的查询的下一个版本,其中我们将此步骤直接应用于每个部分,如示例 4-13所示。
The next version of our query, in which we have directly applied this step to each section, is shown in Example 4-13.
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold()).8by(out("owns").9in("owns").10where(neq("michael")).11values("name").12fold()).13by(out("owes").14in("owes").15where(neq("michael")).16values("name").17fold())
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold()).8by(out("owns").9in("owns").10where(neq("michael")).11values("name").12fold()).13by(out("owes").14in("owes").15where(neq("michael")).16values("name").17fold())
示例4-13的完整结果如下所示:
The full results of Example 4-13 are shown below:
{"CreditCardUsers":[],"AccountOwners":["Maria"],"LoanOwners":[]}
{"CreditCardUsers":[],"AccountOwners":["Maria"],"LoanOwners":[]}
我们查询的主要新增内容出现在上述查询的第 1、5、10 和 15 行。在第 1 行,我们将 Michael 的顶点与步骤一起存储as("michael")。让我们看看where(neq("michael"))第 5 行发生了什么,这与第 10 和 15 行发生的事情相同。
The main additions to our query occur on lines 1, 5, 10, and 15 in the above query. On line 1, we store the vertex for Michael with the as("michael") step. Let’s take a look at what is happening with where(neq("michael")) on line 5, which is the same thing that is happening on lines 10 and 15.
要理解第 5 行发生了什么,您需要记住您在图中的位置。在第 4 行的末尾,我们位于Customer顶点上。具体来说,我们正在处理与 共享帐户的客户Michael。这是where(neq("michael"))步骤开始的地方。我们希望对管道中的每个顶点应用真/假过滤器。真/假过滤器测试是该顶点是否等于 Michael:where(neq("michael"))。如果顶点是 Michael,则第 5 行将其从遍历中消除。如果顶点不是 Michael,则顶点通过过滤器并保留在管道中。
To understand what is happening on line 5, you need to remember where you are in your graph. At the end of line 4, we are on Customer vertices. Specifically, we are processing customers that share an account with Michael. This is where the where(neq("michael")) step comes in. We want to apply a true/false filter to every vertex in the pipeline. The true/false filter test is whether or not that vertex is equal to Michael: where(neq("michael")). If the vertex is Michael, line 5 eliminates it from the traversal. If the vertex is not Michael, the vertex passes through the filter and remains in the pipeline.
根据您团队的数据结构规则,检查数据有效负载中的值是否为空列表可能不是首选。我们可以帮助您围绕此进行设计。
Depending on your team’s data structure rules, checking whether or not a value in your data payload is an empty list may not be preferred. We can help design around that.
我们可以实现 try/catch 逻辑,这样您的查询就不会返回空列表。我们将针对映射中的第一个键逐步执行此操作:CreditCardUsers。完成此操作后,我们将为剩余两个by()步骤添加完整的查询详细信息。
We can implement try/catch logic so that your query doesn’t return an empty list. We will step through this for the first key in the map: CreditCardUsers. After we step through that, we will add in the full query details for the two remaining by() steps.
让我们回过头来,回到刚刚为与 关联的值构建 JSON 负载的过程CreditCardUsers。我们从这里开始:
Let’s rewind and go back to just building up the JSON payload for the value associated to CreditCardUsers. We are starting from here:
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold()).8by(constant("name or no owner for accounts")).9by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold()).8by(constant("name or no owner for accounts")).9by(constant("name or no owner for loans"))
{"CreditCardUsers":[],"AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
{"CreditCardUsers":[],"AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
您可以使用该步骤在 Gremlin 中实现 try/catch 逻辑coalesce()。我们希望对结果进行整形,以便每个键的列表中始终有一个值,例如"CreditCardUsers": ["NoOtherUsers"]。让我们首先看看如何将该coalesce步骤集成到我们的查询中:
You can implement try/catch logic in Gremlin with the coalesce() step. We want to shape the results so that there is always a value in the lists for each key, like "CreditCardUsers": ["NoOtherUsers"]. Let’s start by seeing how to integrate the coalesce step into our query:
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold().8coalesce(constant("tryBlockLogic"),// try block9constant("catchBlockLogic"))).// catch block10by(constant("name or no owner for accounts")).11by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold().8coalesce(constant("tryBlockLogic"),// try block9constant("catchBlockLogic"))).// catch block10by(constant("name or no owner for accounts")).11by(constant("name or no owner for loans"))
最终的有效载荷为:
The resulting payload is:
{"CreditCardUsers":"tryBlockLogic","AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
{"CreditCardUsers":"tryBlockLogic","AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
当您使用coalesce()第 8 行中的步骤时,它需要两个参数。第一个参数位于第 8 行,可以视为 try 块逻辑。第二个参数位于第 9 行,可以视为 catch 块逻辑。
When you use the coalesce() step in line 8, it takes two arguments. The first argument is on line 8 and can be thought of as the try block logic. The second argument is on line 9 and can be thought of as the catch block logic.
如果 try 块逻辑成功,则结果数据将传递到管道中。在本例中,为了便于说明,我们使用了肯定会成功的东西:步骤constant()。"tryBlockLogic"此步骤返回了我们在结果有效负载中看到的字符串。此constant()步骤很有用,原因有很多,其中之一是它可以在您构建更复杂的查询时充当占位符。这就是我们在这里使用它的方式。
If the try block logic succeeds, then the resulting data is passed down the pipeline. In this case, for illustrative purposes, we used something that would definitely succeed: the constant() step. This step returned the string "tryBlockLogic" that we see in the resulting payload. The constant() step is useful for many reasons, one of which is that it can serve as a placeholder while you build up more complicated queries. This is how we are using it here.
如果该步骤的第一个参数coalesce()在第 8 行失败,则第二个参数将在第 9 行执行。让我们看看如何使用它来填充我们想要的数据有效负载:
Should the first argument of the coalesce() step fail on line 8, the second argument will execute on line 9. Let’s look at how we can use this to populate what we want in our data payload:
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold().8coalesce(unfold(),// try block9constant("NoOtherUsers"))).// catch block10by(constant("name or no owner for accounts")).11by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold().8coalesce(unfold(),// try block9constant("NoOtherUsers"))).// catch block10by(constant("name or no owner for accounts")).11by(constant("name or no owner for loans"))
{"CreditCardUsers":"NoOtherUsers","AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
{"CreditCardUsers":"NoOtherUsers","AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
在第 8 行,我们添加到 try 块的逻辑是unfold()。这是尝试获取上一步的结果并成功展开它们。此时管道中的结果是一个空列表[]。在 Gremlin 中,您无法展开空对象。这会引发一个异常,该异常会被 try 块捕获。因此,我们执行第 9 行,即步骤的第二个参数coalesce():constant("NoOtherUsers")。这就是我们在结果有效负载中看到条目的原因"CreditCardUsers": "NoOtherUsers"。
On line 8, the logic that we added to the try block is the unfold(). This is trying to take the results from the previous step and successfully unfold them. The results at this point in the pipeline are an empty list []. In Gremlin, you cannot unfold an empty object. This throws an exception that is caught by the try block. Therefore, we execute line 9, the second argument of the coalesce() step: constant("NoOtherUsers"). This is why we see the entry "CreditCardUsers": "NoOtherUsers" in our result payload.
遗憾的是,我们失去了保证的列表结构。fold()我们可以在步骤后面将其添加回去coalesce():
Regrettably, we lost our guaranteed list structure. We can add that back in with a fold() after the coalesce() step:
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold().8coalesce(unfold(),9constant("NoOtherUsers")).fold()).10by(constant("name or no owner for accounts")).11by(constant("name or no owner for loans"))
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold().8coalesce(unfold(),9constant("NoOtherUsers")).fold()).10by(constant("name or no owner for accounts")).11by(constant("name or no owner for loans"))
{"CreditCardUsers":["NoOtherUsers"],"AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
{"CreditCardUsers":["NoOtherUsers"],"AccountOwners":"name or no owner for accounts","LoanOwners":"name or no owner for loans"}
我们添加的第 5 行到第 9 行的步骤创建了一个可预测的数据结构,可在整个应用程序中进行交换。它将是格式良好的 JSON,其他应用程序可以推断。
The steps we added from line 5 to line 9 create a predictable data structure to exchange throughout your application. It will be well-formatted JSON about which other applications can reason.
接下来,我们需要在每个by()步骤中添加这个 try/catch 逻辑。by()在我们的完整查询中每个步骤末尾添加的完整逻辑模式是:
Next, we need to add this try/catch logic to each by() step. The full logic pattern to add at the end of each by() step in our full query is:
coalesce(unfold(),// try to unfold the namesconstant("NoOtherUsers")).// inject this string if there are no namesfold()// structure the results into a list
coalesce(unfold(),// try to unfold the namesconstant("NoOtherUsers")).// inject this string if there are no namesfold()// structure the results into a list
此 Gremlin 模式可确保结果有效负载中有一个非空列表。完整查询及其结果如下:
This Gremlin pattern ensures we have a nonempty list in the resulting payload. The full query and its results are:
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold().8coalesce(unfold(),9constant("NoOtherUsers")).fold()).10by(out("owns").11in("owns").12where(neq("michael")).13values("name").14fold().15coalesce(unfold(),16constant("NoOtherUsers")).fold()).17by(out("owes").18in("owes").19where(neq("michael")).20values("name").21fold().22coalesce(unfold(),23constant("NoOtherUsers")).fold())
1dev.V().has("Customer","customer_id","customer_0").as("michael").2project("CreditCardUsers","AccountOwners","LoanOwners").3by(out("uses").4in("uses").5where(neq("michael")).6values("name").7fold().8coalesce(unfold(),9constant("NoOtherUsers")).fold()).10by(out("owns").11in("owns").12where(neq("michael")).13values("name").14fold().15coalesce(unfold(),16constant("NoOtherUsers")).fold()).17by(out("owes").18in("owes").19where(neq("michael")).20values("name").21fold().22coalesce(unfold(),23constant("NoOtherUsers")).fold())
{"CreditCardUsers":["NoOtherUsers"],"AccountOwners":["Maria"],"LoanOwners":["NoOtherUsers"]}
{"CreditCardUsers":["NoOtherUsers"],"AccountOwners":["Maria"],"LoanOwners":["NoOtherUsers"]}
我们发现,迭代构建并逐步执行 Gremlin 步骤是理解查询语言的最佳方式。本书旨在教您我们的思维过程,这就是我们使用 Gremlin 的思路。编写图查询的方法不止一种;我们希望您对使用其他步骤来处理相同数据感到好奇。弄清楚这一点可能很简单,只需打开 Studio Notebook 并自行探索新步骤即可。
We find that iterative building and stepping through Gremlin steps is the best way to wrap your head around the query language. This book is about teaching you our thought processes, and this is how we think through using Gremlin. There is more than one way to write a graph query; we hope you are curious about using other steps to process the same data. Figuring this out can be as easy as opening up a Studio Notebook and exploring new steps on your own.
回到本章开头的潜水类比,我们在泳池中训练的时间已经结束了。我们认为,本章中技术示例的进展就像在游泳池中学习浮力控制或深水故障排除一样。在某个时候,你已经学会了在受控环境中练习所能学到的一切。
Bringing back our scuba analogy from the beginning of this chapter, our time training in the pool has come to a close. As we see it, the progression through the technical examples in this chapter is just like learning buoyancy control or deepwater troubleshooting within a pool. At some point, you have learned everything you can from practicing in a controlled environment.
凭借我们在过去几章中构建的基础,现在是时候跳出开发环境并构建一个可用于生产的图数据库了。
With the foundation we have built over the past few chapters, it is time to take the leap out of your development environment and build a production-ready graph database.
在您过于担心之前,这并不意味着您应该了解有关图数据的所有知识。我们仍在继续探索无数主题。
Before you get too concerned, this doesn’t mean you are supposed to know everything there is to know about graph data. There are still myriad topics we are continuing to explore ourselves.
然而,这意味着我们认为您已经准备好更深入地了解在分布式系统中使用图数据。我们设置此示例是为了让您为了解 Apache Cassandra 中的图数据结构的最后一步做好准备。具体来说,下一章将向您展示如何优化分布式应用程序的图结构。
What it does mean, however, is that we think you are ready to move into a deeper understanding of using graph data in distributed systems. We set up this example to get you ready for one last step down into the physical data layer of understanding graph data structures in Apache Cassandra. Specifically, the upcoming chapter will show you how to optimize your graph structures for distributed applications.
在说明我们如何思考图数据的同时,我们特意在本章的示例中设置了一些陷阱。在下一章中,我们将向您展示这些陷阱并指导您解决它们。即将到来的这一章将是使用我们的 C360 示例的最后一章,因为它将描述为该示例创建生产质量图模式的最后一次迭代。
While illustrating how we think through graph data, we purposefully set up some traps in the example in this chapter. In the next chapter, we will show these traps to you and walk you through their resolution. This upcoming chapter will be the last chapter that uses our C360 example, as it will describe the final iteration in creating a production-quality graph schema for this example.
1 Ora Lassila 和 Ralph R. Swick,“资源描述框架 (RDF) 模型和语法规范”,1999 年。https ://oreil.ly/zWcnO
1 Ora Lassila and Ralph R. Swick, “Resource Description Framework (RDF) Model and Syntax Specification,” 1999. https://oreil.ly/zWcnO
2 Kelvin Lawrence,《实用 Gremlin:Apache TinkerPop 教程》,2020 年 1 月 6 日, https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html。
2 Kelvin Lawrence, Practical Gremlin: An Apache TinkerPop Tutorial, January 6, 2020, https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html.
3 Kelvin Lawrence,《实用 Gremlin:Apache TinkerPop 教程》,2020 年 1 月 6 日, https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html。
3 Kelvin Lawrence, Practical Gremlin: An Apache TinkerPop Tutorial, January 6, 2020, https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html.
当您使用 DataStax Graph 时,您正在使用 Cassandra 中的图数据。如果您一直在关注并执行前两章的实施细节,那么您已经在使用它了。
When you use DataStax Graph, you are working with graph data in Cassandra. And if you have been following along and executing the implementation details from the last two chapters, you have already been using it.
从使用传统数据库到使用 Apache Cassandra 的范式转变是,我们根据读取数据的方式来写入数据。
The paradigm shift from working with a traditional database to working with Apache Cassandra is that we write our data according to how we are going to read it.
为了说明我们如何应用这一点,第3章和第 4 章中的示例使用了但跳过了在 Apache Cassandra 中使用图数据的基本主题。边缘方向和 分区键设计等概念对于构建生产质量、可扩展和分布式图数据模型至关重要。
To illustrate how we apply this, the examples in Chapters 3 and 4 used but skipped over fundamental topics of working with graph data in Apache Cassandra. Concepts like edge direction and partition key design are fundamental to building a production-quality, scalable, and distributed graph data model.
我们将深入研究分布式数据的主题,以帮助您在生产堆栈中成功使用分布式图技术。
We are going to dig deeply into the topics of distributed data to set you up for a successful use of distributed graph technology within your production stack.
回想一下,我们在第 4 章末尾提到过,我们故意设置了一些陷阱。我们的示例建立了如图 5-1所示的模式,旨在使用如图 5-1所示的查询。
Recall that we mentioned at the end of Chapter 4 that we purposely set up some traps. Our example built up the schema shown in Figure 5-1 and aimed to use queries like we have in Example 5-1.
我们需要将两个概念联系起来,这样你才能看到全貌。首先,我们所有的查询都使用了开发遍历源dev.V()。DataStax Graph 中的开发遍历源使您能够浏览数据而不必担心索引策略。其次,我们的查询从帐户顶点遍历到交易。示例 5-1中的查询使用生产遍历源g.V()。如果您尝试在 DataStax Studio 中运行示例 5-1中的查询,您将看到类似示例 5-1中的执行错误。
We need to connect two concepts together so you can see the whole picture. First, all of our queries have used the development traversal source dev.V(). The development traversal source in DataStax Graph enables you to walk around your data without worrying about indexing strategies. Second, our queries walk from an account vertex to transactions. The query in Example 5-1 uses the production traversal source g.V(). If you try to run the query in Example 5-1 in DataStax Studio, you will see something like the execution error in Example 5-1.
g.V().has("Customer","customer_id","customer_0").// the customerout("owns").// walk to their account(s)in("withdraw_from","deposit_to")// access all transactions
g.V().has("Customer","customer_id","customer_0").// the customerout("owns").// walk to their account(s)in("withdraw_from","deposit_to")// access all transactions
| 执行错误 |
|---|
com.datastax.bdp.graphv2.engine.UnsupportedTraversalException: com.datastax.bdp.graphv2.engine.UnsupportedTraversalException: |
执行遍历需要一个或多个索引 One or more indexes are required to execute the traversal |
此错误与磁盘上图数据结构的表示有关。在本章的其余部分,我们将深入探究,解释原因,然后应用方法。
This error is tied to the representation of graph data structures on disk. In the rest of this chapter, we take a peek under the hood to explain the why and then apply the how.
本章的主要目的是介绍在投入生产之前有效建模数据的设计和操作建议。为此,本章以第 4 章中的示例为基础,详细介绍了图数据结构在 Apache Cassandra 中的运行方式。
The primary intent of this chapter is to introduce design and operational recommendations for modeling data efficiently prior to entering production. For that, this chapter builds on the example from Chapter 4 by detailing how graph data structures operate in Apache Cassandra.
在本章结束时,您将获得一份适用于任何新问题的 10 条数据建模建议列表。我们也将在本书的其余示例中使用这些相同的技巧。
At the end of this chapter, you will have a list of 10 data modeling recommendations to apply to any new problem. We will use these same tips throughout the remaining examples in this book, too.
我们选择了下一组技术主题来说明构建生产质量分布式图应用程序所需的最低概念集。本章有三个主要部分,与随附的笔记本和技术材料相一致。
We selected the next set of technical topics to illustrate the minimum required set of concepts for building production-quality distributed graph applications. This chapter has three main sections that align with the accompanying notebook and technical materials.
本章的第一部分回顾了我们在第 4 章中使用过但未解释的主题。在这里,我们介绍分布式图结构的基础知识,以便从该章中对查询进行建模。也就是说,您将了解分区键、聚类列和物化视图。
The first section of this chapter revisits the topics we used but did not explain in Chapter 4. Here, we introduce the fundamentals of distributed graph structures to model our queries from that chapter. Namely, you will learn about partition keys, clustering columns, and materialized views.
第二部分将分布式图结构的概念应用于我们的第二组数据建模建议。我们将介绍 Cassandra 主题,例如非规范化、重新审视边缘方向,并讨论加载策略。这些技巧代表了我们为生产质量分布式图架构推荐的数据建模决策。
The second section applies the concepts of distributed graph structures to our second set of data modeling recommendations. We will introduce Cassandra topics such as denormalization, revisit edge direction, and talk about loading strategies. These tips represent data modeling decisions that we recommend for production-quality, distributed graph schema.
最后一节将介绍 C360 示例的最后一次迭代。我们将解释应用物化视图和索引策略概念的架构代码。我们将对 Gremlin 查询进行最后一次迭代,以使用新的优化。
The last section walks through the final iteration of our C360 example. We will explain the schema code that applies the concepts of materialized views and indexing strategies. And we will go through one last iteration of our Gremlin queries to use the new optimizations.
总之,第3、4和5章中的思维过程和开发代表了设计、探索和完成使用分布式图数据的第一个应用程序的模型和查询的开发生命周期。
Altogether, the thought process and development in Chapters 3, 4, and 5 represent the development life cycle of designing, exploring, and finalizing the models and queries for your first application with distributed graph data.
让我们开始进入 Apache Cassandra 中使用图数据的物理数据层的最后一步。
Let’s get started by taking a final step down into the physical data layer of working with graph data in Apache Cassandra.
本节介绍在 Apache Cassandra 中使用图数据结构的基本概念:主键、分区键、聚类列和物化视图。
This section looks at the fundamental concepts of working with graph data structures in Apache Cassandra: primary keys, partition keys, clustering columns, and materialized views.
我们将从图用户的角度讨论这些 Cassandra 数据建模主题。
We are going to discuss these Cassandra data modeling topics from a graph user’s perspective.
首先,我们将讨论您需要了解的有关顶点的知识,然后我们将讨论您需要了解的有关边的知识。对于顶点,您需要了解主键和分区键。对于边,您需要了解聚类列和物化视图。
First, we will talk about what you need to know about vertices, and then we will go over what you need to know about edges. For vertices, you need to know about primary keys and partition keys. For edges, you need to know about clustering columns and materialized views.
让我们从连接一切的概念开始:主键。
Let’s get started with the concept that connects everything: the primary key.
在分布式系统中构建良好数据模型的一个主要挑战是确定如何使用主键唯一地标识数据。
A major challenge of building a good data model within a distributed system is determining how to uniquely identify your data with primary keys.
您已经使用了主键的最简单形式之一:分区键。
You have already worked with one of the simplest forms of a primary key: the partition key.
The partition key is the first element of a primary key in Apache Cassandra. The partition key is the part of the primary key that identifies the location of the data in a distributed environment.
从用户的角度来看,您需要完整的主键才能从系统访问数据。分区键只是主键的第一部分。
From a user’s perspective, the entire primary key is required for you to access your data from the system. The partition key is just the first piece of the primary key.
主键描述了系统中唯一的一个数据。在 DataStax Graph 中,主键可以由一个或多个属性组成。
The primary key describes a unique piece of data in the system. In DataStax Graph, a primary key can be made up of one or more properties.
您已经使用过主键和分区键。在 DataStax Graph 中,您可以在架构 API 中指定所需的主键。我们在上一章中看到了主键的最简单版本(只有一个分区键):
You have already been using and working with primary and partition keys. In DataStax Graph, you specify the desired primary key in the schema API. We saw the simplest version of a primary key—just one partition key—in the previous chapter with:
schema.vertexLabel("Customer").ifNotExists().partitionBy("customer_id",Text).// basic primary key: one partition keyproperty("name",Text).create();
schema.vertexLabel("Customer").ifNotExists().partitionBy("customer_id",Text).// basic primary key: one partition keyproperty("name",Text).create();
该partitionBy()方法指示将包含在标签的分区键中的值。在这种情况下,我们只有一个值,customer_id。这意味着customer_id是顶点的完整主键和分区键Customer。
The partitionBy() method indicates the value that will be included in the label’s partition key. In this case, we have only one value, customer_id. This means that customer_id is the full primary key and partition key for the Customer vertex.
从开发人员的角度来看,此决定会对您的应用程序产生三个影响。首先, 的值customer_id唯一地标识数据。其次,您的应用程序将需要 的值来customer_id读取有关客户的数据。我们稍后将介绍第三点。
From a developer’s perspective, this decision has three consequences for your application. First, the value for customer_id uniquely identifies the data. Second, your application will need the value for customer_id to read the data about the customer. We will cover the third point in a moment.
这两个结果决定了您(用户)如何设计数据的主键和分区键。让我们来看一个例子。以前,您使用主键通过以下方式在 Gremlin 中查找此数据:
These two consequences govern how you, the user, design your data’s primary and partition keys. Let’s take a look at an example. Previously, you used your primary key to look up this data in Gremlin via:
g.V().has("Customer","customer_id","customer_0").elementMap()
g.V().has("Customer","customer_id","customer_0").elementMap()
返回:
This returns:
{"id":"dseg:/Customer/customer_0","label":"Customer","name":"Michael","customer_id":"customer_0"}
{"id":"dseg:/Customer/customer_0","label":"Customer","name":"Michael","customer_id":"customer_0"}
通过完整主键查找顶点或边是在 DataStax Graph 中读取数据的最快方法。这是为数据选择良好的分区和主键如此重要的主要原因之一。
Looking up vertices or edges by their full primary key is the fastest way to read data in DataStax Graph. This is one of the main reasons that selecting a good partition and primary key for your data is so important.
Apache Cassandra 中的分区键还有第三个后果。顶点标签的分区键将图数据分配给分布式环境中的主机。分区键还为您提供了将图数据共置的不同方法。让我们深入了解细节。
There is a third consequence of the partition key in Apache Cassandra. A vertex label’s partition key assigns your graph data to a host within a distributed environment. Partition keys also give you different ways you can colocate your graph data. Let’s dig into the details.
We recommend this section if you like getting deep in the weeds.
本节旨在综合 Cassandra 和图社区中的主题。我们将通过研究不同的分区键选择来共置图数据,探索图分区的一些假设替代方案。我们将以我们开始的示例的分区策略作为结束,但您将更好地了解分区键设计和图分区问题的影响。
This section aims to synthesize topics across the Cassandra and graph communities. We will explore some hypothetical alternatives to graph partitioning by examining different partition key choices to colocate graph data. We will conclude with the partition strategy we started with for our example, but you will have gained a better understanding of the effects of partition key design and the graph partitioning problem.
我们需要对我们此刻的真正意思保持谨慎。
And we need to be pedantic about what we really mean for a brief moment.
“分区”这个词对于两组不同的人来说有着截然不同的含义。Cassandra 社区对分区的理解回答了这个问题:“我的数据在集群中的什么位置?”图社区对该术语的理解回答了这个问题:“如何将图数据组织成一个较小的组以最小化不变量?”
The word partition means two very different things to two different groups of people. The Cassandra community’s understanding of partition answers the question, “Where is my data in my cluster?” The graph community’s understanding of the term answers the question, “How can I organize my graph data into a smaller group to minimize an invariant?”
本书将 Cassandra 社区对分区的定义应用于图数据处理。当我们提到分区时,我们指的是数据局部性,或者说您的数据在分布式系统中写入磁盘的服务器。
This book applies the Cassandra community’s definition of partitioning to working with graph data. When we refer to a partition, we are referencing data locality, or on which server your data is written to disk across your distributed system.
为了说明如何使用分区的想法,让我们回顾一下当前示例的一些数据,如图5-2所示。
To illustrate how we will be using the idea of partitioning, let’s recall some data for our current example, as shown in Figure 5-2.
为了可视化集群中服务器(也称为实例或节点)的数据分配,假设您正在使用在 Apache Cassandra 中运行 DataStax Graph 的四台服务器的集群。在图 5-3中,我们用一个圆圈表示一个分布式集群,该集群有四台运行 Cassandra 的服务器。(图 5-3中的每只眼睛代表 Cassandra 中的 DataStax Graph。)然后,我们通过在集群周围的服务器旁边说明图数据来显示图数据写入磁盘的位置,如图5-3所示。
To visualize data assignment to a server (also referred to as an instance or a node) in a cluster, imagine you are working with a cluster of four servers running DataStax Graph in Apache Cassandra. In Figure 5-3, we represent a distributed cluster with a circle that has four servers running Cassandra. (Each eye in Figure 5-3 represents DataStax Graph in Cassandra.) Then, we show where your graph data is written to disk by illustrating the graph data next to the server around your cluster, as we do in Figure 5-3.
图 5-3中最大的圆圈表示一个由四台服务器组成的集群,每台服务器都标有 Cassandra 眼睛徽标,运行 Apache Cassandra 中的 DataStax Graph。图 5-2中的示例数据显示在数据物理存储的服务器旁边。在 Apache Cassandra 中,数据根据其分区键映射到集群中的特定服务器。
The largest circle in Figure 5-3 represents a cluster of four servers, each indicated with the Cassandra eye logo, running DataStax Graph in Apache Cassandra. The sample data from Figure 5-2 is shown next to the server in which the data is physically stored. In Apache Cassandra, data is mapped to a specific server in your cluster according to its partition key.
在图 5-3中,您会看到数据customer_0映射到四台不同的机器。客户顶点写入服务器 1,贷款顶点在服务器 2 上,帐户顶点写入机器 3,信用卡顶点在机器 0 上。
In Figure 5-3, you see that the data for customer_0 is mapped to four different machines. The customer vertex is written to server 1, the loan vertex is on server 2, the account vertex is written to machine 3, and the credit card vertex is on machine 0.
您可以将分区键及其与分布式环境中的数据局部性的关联视为如下情况:具有相同分区键的数据存储在同一台机器上,而具有不同键的数据可能存储在不同的机器上。
You can think of partition keys and their association to data locality in a distributed environment as follows: data with the same partition key is stored on the same machine, and data with different keys may be stored on different machines.
对于图数据,有一些设计分区键的策略来最大限度地减少图遍历的延迟。不同的分区策略会影响数据的共置,从而影响查询的延迟。
With graph data, there are strategies for designing your partition keys to minimize the latency of your graph traversals. Different partitioning strategies affect the colocation of your data and, therefore, the latency of your query.
为了在处理图数据时尽量减少集群中机器之间的跳转,您可以考虑采用分区策略,将与查询相关的所有数据保留在同一分区中。为了说明这个想法,图 5-4显示了针对 C360 应用程序的预期访问模式优化的分区策略。分区是根据单个客户及其数据定义的,因为 C360 查询通常会查找个人及其相关数据。对于我们的示例数据,我们将为每个个人创建一个分区。
To minimize jumping around machines in your cluster when you are processing your graph data, you may consider a partitioning strategy that keeps all the data related to your query within the same partition. To illustrate this idea, Figure 5-4 shows a partitioning strategy optimized for the expected access pattern of a C360 application. The partitions are defined according to the individual customer and their data because a C360 query will typically be looking for an individual and their associated data. For our sample data, we would create a partition for each individual.
如果您有图论背景,图 5-4所示的分区策略与根据连通分量进行分区类似。
If you have a background in graph theory, the partitioning strategy illustrated in Figure 5-4 is similar to partitioning according to connected components.
如果您有使用 Apache Cassandra 的背景,那么图 5-4中所示的分区策略遵循根据访问模式进行分区的相同做法。
If you have a background in working with Apache Cassandra, the partitioning strategy illustrated in Figure 5-4 follows the same practice of partitioning according to access pattern.
要实现图 5-4中所示的分区策略,您需要将客户的唯一标识符添加为每个顶点标签的分区键。在您的架构代码中,我们使用以下方式实现分区策略:
To implement the partitioning strategy illustrated in Figure 5-4, you would need to add the customer’s unique identifier as the partition key for every vertex label. In your schema code, we implement the partitioning strategy with:
schema.vertexLabel("Account").ifNotExists().partitionBy("customer_id",Text).clusterBy("acct_id",Text).// to be defined in a coming sectionproperty("name",Text).create();
schema.vertexLabel("Account").ifNotExists().partitionBy("customer_id",Text).clusterBy("acct_id",Text).// to be defined in a coming sectionproperty("name",Text).create();
考虑这种分区策略很有用,因为它可以最大限度地减少查询的延迟。以客户为中心的查询的所有数据都位于您环境中的同一节点上。这是物理数据层的优化,当您查询数据时会很有利。
It is useful to consider this partitioning strategy because it minimizes the latency of your query. All the data for your customer-centric query is colocated on the same node in your environment. This is an optimization at the physical data layer that will be advantageous when you query your data.
但是,有两个原因不建议将这种类型的分区策略用于我们为示例探索的查询。回想一下,启动图查询需要完整的主键。图 5-4中所示的设计的第一个缺点是,您需要知道客户的标识符才能在帐户上开始图遍历。
However, there are two reasons why this type of partition strategy will not be recommended for the queries we are exploring for our example. Recall that the full primary key is required to start your graph query. The first downside to the design shown in Figure 5-4 is that you will need to know the customer’s identifier to start your graph traversal at an account.
将此应用于我们的共享帐户示例,acct_14会给我们带来使用此分区策略的另一个缺点。此架构设计将创建两个acct_14与两个不同人相邻的顶点。这意味着您将无法从开始acct_14并找到拥有该帐户的所有客户。这对您的图查询有影响。
Applying this to our example of a shared account, acct_14, brings us to another drawback to using this partition strategy. This schema design will create two vertices about acct_14 that are adjacent to two different people. This means that you won’t be able to start at acct_14 and find all customers who own that account. This has implications for your graph query.
对于我们在此示例中探索的 C360 查询,图 5-4中的分区策略没有意义。但是,当我们在即将到来的示例中讨论树时,考虑数据模型优化以最大限度地减少查询延迟是有意义的。
For the C360 queries we are exploring in this example, the partition strategy from Figure 5-4 doesn’t make sense. When we talk about trees in an upcoming example, however, it makes sense to consider data model optimizations to minimize query latency.
让我们看一下第二种策略,并将其与根据应用程序的访问模式共置数据进行比较。
Let’s look at a second strategy and compare it to colocating data according to your application’s access pattern.
回想一下我们示例的完整架构,并回想一下每个顶点标签都有一个唯一的分区键。您可以将其视为通过尽可能最细粒度的划分来分离图数据:数据的最唯一值。
Think back to the full schema for our example and recall that each vertex label had a single, unique partition key. You can think of this as separating your graph data via the most granular division possible: the data’s most unique value.
图 5-5中的图数据将根据分区键的值将顶点分布在集群中。本质上,每个顶点将映射到不同的分区,因为每个分区键的值都是唯一的。
The graph data in Figure 5-5 would distribute the vertices across your cluster according to the partition key’s value. Essentially, each vertex will be mapped to a different partition because each partition key’s value is unique.
根据唯一键对顶点进行分区的缺点之一是,每当您需要遍历数据时,您都将在分布式环境中的机器之间切换。在应用程序中使用图数据的目的是使用数据中的连接和关系。如果您根据唯一标识符在分布式环境中构建图数据,这也意味着您(可能)每次需要访问连接的数据时都会切换服务器。
One of the drawbacks to partitioning your vertices according to a unique key is that any time you need to walk through your data, you will be jumping between machines across your distributed environment. The purpose of using graph data in your application is to use the connections and relationships in your data. If you structure your graph data across a distributed environment according to unique identifiers, that also means that you will (likely) be switching servers each time you need to access connected data.
在分布式环境中对图数据进行分区的不同策略各有优缺点。根据访问模式对图数据进行分区会限制您遍历连接数据的方式。另一方面,此策略通过将大量数据放在同一节点上来最大限度地减少遍历延迟。
There are benefits and drawbacks to the different strategies for partitioning graph data in a distributed environment. Partitioning your graph data according to access pattern creates limitations on how you can walk through connected data. On the other hand, this strategy minimizes traversal latency by colocating large components of data onto the same node.
对图数据进行分区的最常见方法是通过数据的唯一标识符。这最容易规划查询灵活性,但由于图的分布式特性,也会导致查询延迟。这是我们将用于 C360 示例的方法。
The most common way to partition your graph data is by the data’s unique identifier. This makes it easiest to plan for query flexibility but does also introduce latency to your queries due to the distributed nature of your graph. This is the approach we will use for our C360 example.
了解任何分区策略的影响的唯一方法是计算它对您的数据和查询的影响。这需要在理解您当前需要构建的应用程序的数据分布与考虑您正在构建的未来范围之间取得平衡。
The only way to understand the implications of any partition strategy is to calculate what it would look like for your data and queries. This requires a balance between understanding the distributions of the data for the application you need to build today and considering the future scope toward which you are building.
当我们处理图数据时,选择一个好的分区策略会更加复杂。在分布式环境中对图数据进行分区等同于将图数据分成不同的部分。优化哪些数据属于特定部分被归类为计算机科学中最难的问题类型之一:NP 完全问题。虽然这可能不是最好的消息,但这有助于解释为什么在分布式环境中使用图技术并不像将实体关系图转换为图数据模型那么简单。
Selecting a good partitioning strategy is more complicated when we are working with graph data. Partitioning graph data around a distributed environment is synonymous with breaking up your graph data into different sections. Optimizing which data belongs to a particular section is classified as one of the hardest types of problems in computer science: an NP-complete problem. While maybe not the best news, this helps to explain why using graph technologies in a distributed environment isn’t as simple as translating an entity-relationship diagram into a graph data model.
关于分区,这里要重申两个要点:唯一性和局部性。在 DataStax Graph 中,数据的主键是其唯一标识符。为了获得最快的性能,您可以通过数据的完整主键启动图查询。
On the topic of partitioning, there are two main takeaways to restate here: uniqueness and locality. In DataStax Graph, your data’s primary key is its unique identifier. For the fastest performance, you start your graph queries by data via its full primary key.
第二件需要注意的事情是,数据的分区键决定了它在集群中的位置。这决定了集群中的哪些机器将存储数据,以及其他数据的共置。
The second thing to note is that your data’s partition key determines its locality in your cluster. This governs which machines in your cluster will store the data and the colocation of other data alongside it.
鉴于分区键的唯一性和局部性,让我们看一下边在 Apache Cassandra 中是如何表示的。
Given uniqueness and locality with partition keys, let’s take a look at how edges are represented in Apache Cassandra.
深入图建模的世界会带来大量的术语、概念和思维模式。现在我们已经了解了基础知识,让我们看一下图数据(即边)如何在磁盘或内存中表示。
Diving into the world of graph modeling brings a large wave of terms, concepts, and thinking patterns. Now that we have the basics covered, let’s take a look at how graph data, namely the edges, can be represented on disk or in memory.
用于表示数据中边的主要数据结构有三种:
There are three main data structures for representing edges in data:
An edge list is a list of pairs in which every pair contains two adjacent vertices. The first element in the pair is the source (from) vertex, and the second element is the destination (to) vertex.
An adjacency list is an object that stores keys and values. Each key is a vertex, and the value is a list of the vertices that are adjacent to the key.
An adjacency matrix represents the full graph as a table. There is a row and column for each vertex in the graph. An entry in the matrix indicates whether there is an edge between the vertices represented by the row and column.
为了理解这些数据结构,让我们看看如何将图数据的小示例映射到每个结构中。
To understand these data structures, let’s look at how we would map a small example of graph data into each structure.
图 5-6中展示了大量细节。在顶部,我们展示了一个由四条边连接的五个顶点的示例。将数据映射到下面每个图数据结构中时,方向很重要。
There is a significant amount of detail illustrated in Figure 5-6. At the top, we show an example of five vertices that are connected by four edges. Direction matters when you map the data into each of the graph data structures below.
让我们了解一下每个数据结构。
Let’s walk through each data structure.
在图 5-6的左下方,我们写出了示例数据如何存储在边列表中。边列表包含四个条目:示例数据中每条边一个条目。在中间,我们表示示例数据如何映射到邻接表。邻接表有两个键:每个有传出边的顶点一个键。每个键的值是边指向的传入顶点的列表。最右边显示的最后一个数据结构是邻接矩阵。它有五行五列:图中每个顶点一行或一列。矩阵中的每个条目表示是否有从行顶点到列顶点的边。
On the lower left in Figure 5-6, we have written out how the example data would be stored in an edge list. The edge list contains four entries: one entry per edge in our example data. In the center, we represent how the example data maps to an adjacency list. The adjacency list has two keys: one key per vertex with outgoing edges. The value for each key is a list of the incoming vertices the edges point to. The last data structure, shown on the far right, is an adjacency matrix. There are five rows and five columns: one row or column for each vertex in the graph. Each entry in the matrix indicates whether there is an edge going from the row vertex to the column vertex.
每种数据结构都有空间和时间权衡。跳过可以针对每种单独的数据结构进行的优化,让我们从基本层面考虑每种结构的复杂性。边列表是代表图的最压缩版本,但您必须扫描整个数据结构才能处理有关特定顶点的所有边。邻接矩阵是遍历数据的最快方法,但它们占用了过多的空间。邻接列表结合了其他两种模型的优点,通过提供索引方式来访问顶点并将列表扫描限制为仅单个顶点的传出边。
There are space and time trade-offs for each data structure. Skipping over optimizations that can be made for each individual data structure, let’s consider the complexities of each at a basic level. Edge lists are the most compressed version of representing your graph, but you have to scan the entire data structure to process all edges about a specific vertex. Adjacency matrices are the fastest way to walk through your data, but they take up an inordinate amount of space. Adjacency lists combine the benefits of the other two models by providing an indexed way to access a vertex and limit the list scans to only the individual vertex’s outgoing edges.
在 DataStax Graph 中,我们使用 Apache Cassandra 作为分布式邻接列表来存储和遍历您的图数据。让我们深入研究如何优化磁盘上边的存储,以便您可以在图遍历期间从边的排序顺序中获得最大的好处。
In DataStax Graph, we use Apache Cassandra as a distributed adjacency list to store and traverse your graph data. Let’s dig into how we optimize the storage of edges on disk so you can get the most benefit out of the sorting order of your edges during your graph traversals.
您使用了聚类列的概念当您在第 4 章中向图中添加边标签时。
You used the concept of clustering columns when you added edge labels to your graph in Chapter 4.
聚类列决定了磁盘上表中数据的排序顺序。
A clustering column determines a sorting order of your data in tables on disk.
聚类列构成 Cassandra 中表主键的最后组成部分。聚类列告知数据库如何按排序顺序将行存储在磁盘上,从而使数据检索更加高效。
Clustering columns make up the final components of a table’s primary key in Cassandra. Clustering columns inform the database how to store the rows in a sorted order on disk, which makes data retrieval more efficient.
我们想深入研究聚类列的细节,因为它们同时解释了两个概念。首先,聚类列的技术含义详细说明了本章开头的查询返回错误的原因。其次,聚类列说明了我们如何在邻接列表结构中对磁盘上的边进行排序,以提供尽可能快的访问。
We want to dig into the details of clustering columns because they explain two concepts at the same time. First, the technical implications of clustering columns detail exactly why the query at the beginning of this chapter returned an error. Second, clustering columns illustrate how we sort your edges on disk in an adjacency list structure to provide the fastest access possible.
例 5-2说明了在创建边标签时使用聚类列。
Example 5-2 illustrates the use of a clustering column in creating an edge label.
schema.edgeLabel("owns").ifNotExists().from("Customer").// the edge label's partition keyto("Account").// the edge label's clustering columnproperty("role",Text).create()
schema.edgeLabel("owns").ifNotExists().from("Customer").// the edge label's partition keyto("Account").// the edge label's clustering columnproperty("role",Text).create()
按照示例 5-2中的示例,我们可以为边标签挑选出分区键和聚类列:
Following the example in Example 5-2, we can pick out the partition key and clustering columns for our edge label:
该步骤意味着顶点的完整主键将成为边标签的分区键:。from(Customer)Customerowns(customer_id)
The from(Customer) step means that the full primary key of the Customer vertex will be the partition key for the edge label owns: (customer_id).
The full primary key for Account will be the clustering column for the owns edge: (acct_id).
综合起来,我们可以将 Cassandra 的表结构与图模式一起布局,如图5-7所示。
Putting this together, we can lay out Cassandra’s table structures alongside the graph schema, as see in Figure 5-7.
图 5-7显示了 Cassandra 中的表结构,它们使用图模式语言 (GSL) 映射到图模式。顶点Customer创建一个具有分区键的表customer_id。owns边连接Customer到Account。owns边的分区键是customer_id。owns边还有一个聚类键,in_acct_id它是帐户顶点的分区键。customer_owns_account表中有第三列:role。这是一个简单属性,不是主键的一部分。因此,的值role将来自客户和帐户之间边的最近写入。
Figure 5-7 shows the table structures in Cassandra as they map to graph schema using the Graph Schema Language (GSL). The Customer vertex creates a table with a partition key, customer_id. The owns edge connects Customer to Account. The partition key of the owns edge is the customer_id. The owns edge also has a clustering key, in_acct_id, which is the partition key of the account vertex. There is a third column in the customer_owns_account table: role. This is a simple property and is not a part of the primary key. As a result, the value for role will come from the most recent write of an edge between a customer and an account.
为了具体说明这一点,图 5-8显示了遵循图 5-7中的模式的数据示例。
To make this concrete, Figure 5-8 shows an example of data that follows the schema from Figure 5-7.
在我们讨论其他主题之前,还有最后一个关于 DataStax Graph 中的聚类键和边的想法需要综合。在第 2 章中,我们概述了您希望两个顶点之间有多条边的情况。我们在 GSL 中用双线边表示这种情况。在 Cassandra 中,我们将该属性设为聚类键。图 5-9显示了 Cassandra 模式以及将相邻顶点建模为集合的图模式。
Before we move on to a different topic, there is one last idea to synthesize about clustering keys and edges in DataStax Graph. In Chapter 2, we outlined cases in which you want to have many edges between two vertices. We denote this in the GSL with a double-lined edge. In Cassandra, we would make that property a clustering key. Figure 5-9 shows the Cassandra schema alongside a graph schema that models adjacent vertices as a collection.
图 5-10显示了 Cassandra 中的表结构,当我们对实例顶点之间有多条边的图的多重性进行建模时,它们会使用 GSL 映射到图模式。不同之处在于边的表中owns。现在,我们将role作为此边的聚类键,位于 的聚类键之前。图 5-9acct_id中的模式允许顶点之间存在多条边,如图5-10所示。
Figure 5-10 shows the table structure in Cassandra as they map to graph schema using the GSL when we model the multiplicity of a graph with many edges between instance vertices. The difference is in the table for the owns edge. We now have the role as a clustering key for this edge, before the clustering key for the acct_id. The schema from Figure 5-9 allows there to be multiple edges between vertices, as we show in Figure 5-10.
现在我们了解了磁盘上边缘的结构,让我们看看它们在分布式环境中的存储位置。
Now that we understand the structure of edges on disk, let’s visit where they will be stored within a distributed environment.
回想一下,分区键标识了数据在集群中的写入位置。这意味着顶点的传出边将与顶点本身存储在同一台机器上。我们在图 5-5中预览了这一点,因为边的颜色与客户顶点相同;它们都是橙色。为了说明我们的意思,让我们看看集群中边的局部性,如图5-11所示。
Recall that the partition key identifies where the data will be written within the cluster. This means that the outgoing edges for a vertex will be stored on the same machine as the vertex itself. We previewed this in Figure 5-5 because the edges have the same color as the customer vertices; they are all orange. To illustrate what we mean, let’s look at the locality of edges in our cluster, as shown in Figure 5-11.
图 5-11中的图像说明了在分布式环境中的边的customer_0存储位置。每条边都将与顶点位于同一台机器上,customer_0因为每条边都有相同的分区键:customer_id。
The image in Figure 5-11 illustrates where the edges for customer_0 will be stored within a distributed environment. Each of the edges will be colocated on the same machine as the vertex for customer_0 because each edge has the same partition key: the customer_id.
接下来要理解的是边在其分区内是如何排序的。相邻顶点标签的完整主键将成为边标签的聚类列。这意味着边在磁盘上根据其传入顶点的主键进行排序,如图 5-12所示。
The next thing to understand is how the edges are sorted within their partition. The full primary key of the adjacent vertex label becomes the clustering column(s) of the edge label. This means that the edges are sorted on disk according to their incoming vertex’s primary key, as visualized in Figure 5-12.
图 5-12中需要理解的主要概念如右图所示。我们展示了customer_4Aaliyah 的顶点被写入集群中机器 1 上的磁盘。同样在机器 1 上,我们将找到 Aaliyah 的传出边,这些边按传入顶点排序。Aaliyah 有两笔贷款与她有关,且有一条owes边。我们看到,在磁盘上,这些边将根据传入顶点的分区键进行排序。loan_id我们看到loan_18是第一个条目,loan_80是第二个条目。
The main concept to understand from Figure 5-12 is illustrated on the right. We are showing that the vertex for customer_4, Aaliyah, is written to disk on machine 1 in our cluster. Also on machine 1, we will find the outgoing edges from Aaliyah sorted according to their incoming vertex. Aaliyah has two loans connected to her with an owes edge. We see that on disk, these edges will be sorted according to the incoming vertex’s partition key, the loan_id. We see loan_18 is the first entry and loan_80 is the second entry.
要检查您是否正在合成概念:Michael、Maria、Rashika 和 Jamie 的客户顶点在图 5-12中位于何处?答案:每个顶点的分区键是它们的customer_id,它将被散列并映射到任何一台服务器。因为我们总共与五个客户合作,所以至少有一台服务器有两个客户顶点。这种逻辑在数学上被称为“鸽巢原理”。
To check whether you are synthesizing concepts: where would the customer vertices for Michael, Maria, Rashika, and Jamie be in Figure 5-12?
Answer: The partition key for each of those vertices is their customer_id, which would be hashed and mapped to any one of the servers. Because we are working with five customers in total, there will be at least one server with two customer vertices. This logic is referred to as the “pigeonhole principle” in mathematics.
您可能会问自己:我们为什么要进行这些操作?这一切都归结为访问 Apache Cassandra 中的数据的最低要求:分区键。
You might be asking yourself: why are we getting into all this? It all comes down to the minimum requirement for accessing a piece of data in Apache Cassandra: the partition key.
您将感受到边缘主键设计的影响的主要领域是如何访问边缘。要使用边,您必须知道它的分区键。
The main area in which you are going to feel the effects of an edge’s primary key design comes into how you access your edges. To use an edge, you have to know its partition key.
因此,我们还不能反向遍历我们的边!这是因为在我们的示例中,系统中没有以传入顶点标签中的分区键开头的边。
Because of this, we cannot yet traverse our edges in the reverse direction! This is because there are no edges in the system that start with the partition key from the incoming vertex labels in our examples.
还记得示例 5-1中的查询吗?
Remember our query from Example 5-1?
g.V().has("Customer","customer_id","customer_0").// the customerout("owns").// walk to their account(s)in("withdraw_from","deposit_to")// walk to all transactions
g.V().has("Customer","customer_id","customer_0").// the customerout("owns").// walk to their account(s)in("withdraw_from","deposit_to")// walk to all transactions
回想一下我们在上一章中构建的模式,deposit_to边从 a 指向Transaction。Account但是,此查询尝试沿相反方向遍历该边:从Account到Transaction。
Recalling the schema we built in the previous chapter, the deposit_to edges point from a Transaction to an Account. However, this query is trying to walk that edge in the reverse direction: from the Account to the Transaction.
应用我们刚刚学到的有关 DataStax Graph 中边的知识,我们知道发生此错误是因为磁盘上不存在该边。该边是从交易写入账户的,但不是反向写入。
Applying what we just learned about edges in DataStax Graph, we know that this error happens because the edge does not exist on disk. The edge was written from the transaction to the account, but not the reverse.
如果我们想从账户走到交易,那么我们也需要存储另一个方向的边。出于性能考虑,DataStax Graph 默认不这样做,类似于在关系数据模型中索引每一列是一种反模式。
If we want to walk from accounts to transactions, then we need to store the edge in the other direction as well. This is not done by default in DataStax Graph because of performance implications, similar to how indexing every column in a relational data model is an antipattern.
我们需要的是双向边,即两个方向的边。这个选项将我们引向本章的最后一个技术主题。
What we need are bidirectional edges, or edges that go in both directions. This option brings us to the last technical topic in this chapter.
工程师喜爱 Apache Cassandra 的主要原因之一是他们愿意牺牲数据重复来换取更快的数据访问速度。这就是物化视图在 DataStax Graph 中发挥作用的地方。从用户的角度来看,你可以将物化视图视为如下形式:
One of the main reasons engineers love Apache Cassandra is they are willing to trade data duplication for faster data access. This is where materialized views come into play with DataStax Graph. From the user’s perspective, you can think of a materialized view as follows:
物化视图在具有不同主键结构的单独表中创建并维护数据的副本,而不需要您的应用程序多次手动写入相同的数据来创建所需的访问模式。
A materialized view creates and maintains a copy of the data in a separate table with a different primary key structure, rather than requiring your application to manually write the same data multiple times to create the access patterns you need.
在底层,DataStax Graph 使用物化视图来沿着边缘反向行走。
Under the hood, DataStax Graph uses materialized views to be able to walk an edge in its reverse direction.
为了演示,示例 5-3展示了如何在现有边标签上创建物化视图deposit_to。
To demonstrate, Example 5-3 shows how to create a materialized view on the existing edge label for deposit_to.
schema.edgeLabel("deposit_to").from("Transaction").to("Account").materializedView("Transaction_Account_inv").ifNotExists().inverse().create()
schema.edgeLabel("deposit_to").from("Transaction").to("Account").materializedView("Transaction_Account_inv").ifNotExists().inverse().create()
示例 5-3在 Apache Cassandra 中创建一个名为 的表"Transaction_Account_inv"。此表的分区键是acct_id。聚类列是transaction_id。
Example 5-3 creates a table in Apache Cassandra called "Transaction_Account_inv". The partition key for this table is the acct_id. The clustering column is transaction_id.
示例 5-3中的完整主键写作(acct_id, transaction_id)。这种表示法表示完整主键包含两部分数据:acct_id和transaction_id。第一个值acct_id是分区键,第二个值transaction_id是聚类列。
The full primary key from Example 5-3 is written as (acct_id, transaction_id). This notation means that the full primary key contains two pieces of data: acct_id and transaction_id. The first value, acct_id, is the partition key, and the second value, transaction_id, is the clustering column.
从用户的角度来看,这使我们能够deposit_to从账户到交易了解边缘。为了证实这一点,让我们通过检查磁盘上的数据来查看存储在这两个数据结构之间的边缘。
From the user’s perspective, this gives us the ability to walk through the deposit_to edge from accounts to transactions. To convince ourselves of this, let’s see the edges that are stored between these two data structures by inspecting the data on disk.
我们可以通过查询 Apache Cassandra 中的底层数据结构来检查磁盘上的边缘deposit_to标签。有两个表需要检查。首先,让我们看一下 的原始表Transaction_deposit_to_Account;您可以使用以下命令从 DataStax Studio 执行此操作(结果显示在表 5-2中):
We can inspect the edges on disk for the deposit_to edge label by querying the underlying data structures in Apache Cassandra. There are two tables to inspect. First, let’s look at the original table for Transaction_deposit_to_Account; you can do this from DataStax Studio with the following (the results are shown in Table 5-2):
select* 从"Transaction_deposit_to_Account";
select* from"Transaction_deposit_to_Account";
| 交易_交易_id | 帐户 ID |
|---|---|
220 220 |
acct_14 acct_14 |
221 221 |
acct_14 acct_14 |
222 222 |
acct_0 acct_0 |
223 223 |
acct_5 acct_5 |
224 224 |
acct_0 acct_0 |
这以下查询显示如何列出deposit_to边标签的物化视图在磁盘上的所有边,表 5-3显示结果:
The following query shows how to list all edges on disk for the materialized view of the deposit_to edge label, and Table 5-3 displays the results:
select* 从"Transaction_Account_inv";
select* from"Transaction_Account_inv";
| 帐户 ID | 交易_交易_id |
|---|---|
acct_0 acct_0 |
222 222 |
acct_0 acct_0 |
224 224 |
acct_5 acct_5 |
223 223 |
acct_14 acct_14 |
220 220 |
acct_14 acct_14 |
221 221 |
让我们仔细看看表 5-2和表 5-3之间的区别。最容易发现的是涉及的交易acct_5。在表 5-2中,我们看到此边的分区键是out_transaction_id,即223。聚类列是in_acct_id,即acct_5。
Let’s look very closely at the differences between Table 5-2 and Table 5-3. The easiest one to spot is the transaction involving acct_5. In Table 5-2, we see that the partition key for this edge is out_transaction_id, which is 223. The clustering column is in_acct_id, which is acct_5.
检查表 5-3 (表 5-2的物化视图)中同一条边的存储方式。我们可以看到边的键被翻转了;此边的分区键为in_acct_id,即acct_5,而聚类列为out_transaction_id,即223。现在,我们在示例中可以使用双向边。
Examine how this same edge is stored in Table 5-3, the materialized view of Table 5-2. We can see that the edge’s keys are flipped; the partition key for this edge is in_acct_id, which is acct_5, and the clustering column is out_transaction_id, which is 223. We now have bidirectional edges to use in our example.
我们刚刚介绍了我们计划在本书中介绍的 Apache Cassandra 主题的所有技术解释。我们对技术概念的解释有意只是从图应用程序工程师的角度对 Apache Cassandra 内部进行表面介绍。在分布式系统中,还有更多关于分区键、聚类列、物化视图等内容需要了解。
We just walked through all of the technical explanations for topics in Apache Cassandra that we have planned for this book. Our explanations of the technical concepts are intentionally only a surface-level introduction to the internals of Apache Cassandra, presented from the perspective of a graph application engineer. There is much more to understand about partition keys, clustering columns, materialized views, and more within distributed systems.
我们鼓励您深入了解,并可以推荐另外两种资源来帮助您实现这一目标。
We encourage you to go deeper and can recommend two other resources to get you there.
首先,深入了解Apache Cassandra,可以考虑选择另一本 O'Reilly 书籍: Jeff Carpenter 和 Eben Hewitt 撰写的《Cassandra:权威指南》,第三版(O'Reilly)。
First, for a deep dive on the internals of Apache Cassandra, consider picking up a different O’Reilly book: Cassandra: The Definitive Guide, Third Edition by Jeff Carpenter and Eben Hewitt (O’Reilly).
或者,如果想全面了解分布式系统的内部结构,请查看 Alex Petrov 的《数据库内部结构:深入探究分布式数据系统的工作原理》(O'Reilly)。
Or for a complete examination of the internals of distributed systems, check out Alex Petrov’s Database Internals: A Deep Dive into How Distributed Data Systems Work (O’Reilly).
我们从分布式图数据的内部结构回来,最后一次介绍我们的 C360 示例。应用我们讨论过的概念可以为我们提供更多的数据建模建议、架构优化以及一些实现 Gremlin 查询的新方法。这正是我们要去的地方。
We are coming back up from the internals of distributed graph data for one last pass of our C360 example. Applying the concepts we have discussed can give us more data modeling recommendations, schema optimizations, and a few new ways to implement our Gremlin queries. So that is exactly where we are going.
下一节将把我们对 Apache Cassandra 中的键和视图的知识应用于使用 DataStax Graph 的数据建模最佳实践。
The upcoming section applies our knowledge of keys and views in Apache Cassandra to data modeling best practices with DataStax Graph.
DataStax Graph 中顶点和边布局的新知识开辟了更多的数据建模优化。让我们运用对分区键、聚类列和物化视图的理解,并查看我们的第二组数据建模建议(从第 4 章提供的六条建议中挑选)。
The new knowledge of the layout of vertices and edges in DataStax Graph opens up more data modeling optimizations. Let’s apply our understanding of partition keys, clustering columns, and materialized views and visit our second set of data modeling recommendations (picking up from the six recommendations provided in Chapter 4).
首先,让我们回想一下图模式的结束位置——参见图 5-13。
To begin with, let’s recall where our graph schema left off—see Figure 5-13.
这给我们带来了下一个数据建模建议。
This brings us to our next data modeling recommendation.
属性可以复制到边或顶点上;使用非规范化来减少查询中必须处理的元素数量。
Properties can be duplicated onto edges or vertices; use denormalization to reduce the number of elements you have to process in a query.
要应用此技巧,请考虑一个帐户有数千笔交易的情况。当我们想要找到最近的 20 笔交易时,我们需要通过遍历所有交易来访问帐户顶点,然后才能按时间对顶点进行子选择。遍历所有边以访问所有交易,然后对交易顶点进行排序,这是非常昂贵的。
To apply this tip, consider a case in which an account has thousands of transactions. When we want to find the most recent 20 transactions, we need to access the account vertex by walking through all transactions before we can subselect the vertices by time. It is pretty expensive to traverse all of the edges to access all transactions and then sort the transaction vertices.
我们能否变得更聪明并减少需要处理的数据量?
Can we be smarter and reduce the amount of data we have to process?
我们可以。具体来说,我们可以将交易的时间存储在两个地方:交易顶点和边上。这样,我们可以对边进行子选择,以将遍历限制为最近的 20 条边。图 5-14说明了将时间复制到边标签上。
We can. Specifically, we can store a transaction’s time in two places: on the transaction vertex and on the edges. This way, we can subselect the edges to limit our traversal to only the most recent 20 edges. Figure 5-14 illustrates duplicating time onto an edge label.
为简单起见,在图 5-14中我们仅展示了在边上添加时间戳withdraw_from;我们将对deposit_to和charge边标签应用相同的技术。
For simplicity’s sake, in Figure 5-14 we show only the addition of a timestamp to the withdraw_from edge; we will apply the same technique for the deposit_to and charge edge labels.
这种优化要求您的应用程序将相同的时间戳写入边和顶点。这称为非规范化。
This type of optimization requires your application to write the same timestamp onto both the edge and the vertex. This is called denormalization.
Denormalization is the strategy of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data grouped differently.
复制属性或非规范化是一种非常流行的策略,它可以平衡无限查询灵活性和查询性能之间的矛盾。一方面,在图数据库中对数据进行建模可以提高灵活性并更轻松地集成数据源。这种灵活性是团队选择图技术的主要原因之一;图技术本身就集成了更具表现力的建模和查询语言。
Duplicating properties, or denormalization, is a very popular strategy that balances the dualities between unlimited query flexibility and query performance. On one hand, modeling your data in a graph database allows for more flexibility and easier integration of data sources. This flexibility is one of the main reasons teams are picking up graph technologies; graph technology inherently integrates more expressive modeling and query languages.
另一方面,开发过程中规划不周导致许多团队对其生产图模型抱有不切实际的期望。他们更注重数据模型的灵活性,而牺牲了查询性能。如果您利用非规范化等建模技巧,您的查询将更高效。
On the other hand, poor planning during development has left many teams before you with unrealistic expectations for their production graph model. They focused more on data model flexibility at the expense of query performance. Your queries will be more performant if you take advantage of modeling tricks like denormalization.
在开始向所有边缘添加属性和物化视图之前,请考虑我们的下一个建议。
Before you start adding properties and materialized views to all of your edges, consider our next recommendation.
让您想要通过边标签走的方向决定图模式中边标签上所需的索引。
Let the direction you want to walk through your edge labels determine the indexes you need on an edge label in your graph schema.
在本技巧中,我们要求您做几件事。首先,我们建议您先在开发模式下制定 Gremlin 查询,就像我们在第 4 章中所做的那样。然后我们可以应用这些最终查询来确定您需要的物化视图。您不需要为所有内容建立索引。
With this tip, we are asking you to do a few things. First, we are advising you to work out your Gremlin queries in development mode first, just like we did in Chapter 4. Then we can apply those final queries to determine only the materialized views that you need. You don’t need indexes for everything.
在 DataStax Graph 中有两种方法可以执行此操作:您可以自己执行此操作,也可以告诉系统执行此操作。
There are two ways to do this in DataStax Graph: you can do it yourself, or you can tell the system to do it.
Let’s start with what it would look like if you were to figure out indexes on your own.
要识别何时需要索引,您必须将 Gremlin 查询映射到图模式上。将查询映射到模式是我们在本书中一直在练习的事情,但让我们看看图 5-15中绘制的样子。我们将从头到尾绘制出模式中第一个查询的步骤。然后,我们使用覆盖在模式上的查询步骤来确定我们需要边缘索引的位置。 图 5-15描绘了在模式上绘制的查询步骤,后面是示例 5-4,其中显示了 Gremlin 查询。
To recognize when you need an index, you have to map your Gremlin query onto your graph schema. Mapping a query onto schema is something we’ve been mentally practicing throughout this book, but let’s see what this looks like drawn out in Figure 5-15. We will draw out our first query’s steps in our schema from start to end. Then, we use the query steps overlaid on our schema to identify where we will need an edge index. Figure 5-15 depicts a query’s steps drawn over schema followed by Example 5-4, which shows the Gremlin query.
1dev.V().has("Customer","customer_id","customer_0").// [START]2out("owns").// [1 & 2]3in("withdraw_from","deposit_to").// [3]4order().// [3]5by(values("timestamp"),desc).// [3]6limit(20).// [3]7values("transaction_id")// [END]
1dev.V().has("Customer","customer_id","customer_0").// [START]2out("owns").// [1 & 2]3in("withdraw_from","deposit_to").// [3]4order().// [3]5by(values("timestamp"),desc).// [3]6limit(20).// [3]7values("transaction_id")// [END]
让我们分解一下图 5-15和示例 5-4中展示的内容。我们将查询的每个步骤映射到查询过程中所经过的架构。标有“来自”的框Start将End通过架构元素映射一条绿色路径,以将查询的步骤与我们在整个架构中走过的位置相匹配。
Let’s break down what we are showing in Figure 5-15 alongside Example 5-4. We mapped each step of the query to the schema that you walk through during the query. The boxes labeled from Start to End map a green path through the schema elements to match the query’s steps to where we are walking throughout our schema.
我们可以按如下方式考虑对模式的演练。我们通过唯一地标识客户来开始遍历,在查询和模式中用方框显示Start。这是我们查询中的第 1 行。然后我们使用边owns访问该客户的帐户;这显示在标有1和的方框中2。这是我们查询中的第 2 行。方框3将交易的处理和排序映射在一起。这映射到我们查询中的第 3、4、5 和 6 行。End遍历停止的位置是查询的第 7 行。
The walk through our schema can be thought of as follows. We begin the traversal by uniquely identifying a customer, shown in the query and schema with the Start box. This is line 1 in our query. Then we use the owns edge to access that customer’s account; this is shown in the boxes labeled 1 and 2. This is line 2 in our query. Box 3 maps together the processing and sorting of transactions. This maps to lines 3, 4, 5, and 6 in our query. End labels where the traversal stops, on line 7 of our query.
图 5-15中最重要的概念是步骤3。查询遍历传入withdraw_from和deposit_to边标签以访问顶点标签。但是,在我们的架构中,我们逆着这些边标签的方向Transaction行走。我们在图 5-15中用橙色虚线突出显示了这一点。
The most important concept in Figure 5-15 is at step 3. The query walks through the incoming withdraw_from and deposit_to edge labels to access the Transaction vertex label. However, we are walking against the direction of these edge labels in our schema. We highlighted this in Figure 5-15 with orange dotted lines.
能够在心理上看到我们正沿着边缘标签的方向行走,这可以确定您在图中需要物化视图的位置。这是一个非常重要的概念,我们希望您从图 5-15和示例 5-4中理解了这一点。我们认为最后一个例子是理解 Apache Cassandra 中的图数据最基本的顿悟时刻之一,我们希望您能理解。
Being able to mentally see that we are walking against the direction of an edge label identifies where you need a materialized view in your graph. This is a very important concept that we hope you followed from Figure 5-15 alongside Example 5-4. We think of this last example as one of the most fundamental aha moments for understanding graph data in Apache Cassandra, and we hope you got there.
如果您觉得在脑海中处理所有这些事情是新的或者不自然的,还有另一种方法:您可以让 DataStax Graph 为您完成这一切。
If juggling all of this in your head is new or does not feel natural, there is another way: you can let DataStax Graph do it for you.
DataStax Graph 有一个智能索引推荐系统,称为indexFor。为了让索引分析器找出特定遍历所需的索引,您需要做的就是使用我们在图 5-15schema.indexFor(<your_traversal>).analyze()中执行的查询来执行:
DataStax Graph has an intelligent index recommendation system called indexFor. To let the index analyzer figure out what indexes a particular traversal requires, all you need to do is execute schema.indexFor(<your_traversal>).analyze() using the query we walked through in Figure 5-15:
schema.indexFor(g.V().has("Customer","customer_id","customer_0").out("owns").in("withdraw_from","deposit_to").order().by(values("timestamp"),desc).limit(20).values("transaction_id")).analyze()
schema.indexFor(g.V().has("Customer","customer_id","customer_0").out("owns").in("withdraw_from","deposit_to").order().by(values("timestamp"),desc).limit(20).values("transaction_id")).analyze()
由于我们已经为 创建了物化视图deposit_to,因此此命令将仅输出一条建议。输出包含以下信息,此处重新格式化以使其更易于阅读:
Because we already created a materialized view for deposit_to, this command will output only one recommendation. The output contains the following information, reformatted here to make it easier to read:
遍历需要创建以下索引: 模式.edgeLabel("withdraw_from")。 从("Transaction")。 到("Account")。 物化视图("Transaction__withdraw_from__Account_by_Account_acct_id")。 如果不存在()。 逆()。 创造()
Traversal requires that the following indexes are created: schema.edgeLabel("withdraw_from"). from("Transaction"). to("Account"). materializedView("Transaction__withdraw_from__Account_by_Account_acct_id"). ifNotExists(). inverse(). create()
本质上,图 5-15和indexFor(<your_traversal>).analyze()正在做同样的事情。它们将您的遍历映射到您的模式上,以查看您需要物化视图的位置。
Essentially, Figure 5-15 and indexFor(<your_traversal>).analyze() are doing the same thing. They are mapping your traversal onto your schema to see where you need a materialized view.
开发完所有查询后(如我们在第 4 章中所做的那样),您可以使用任一技术来确定生产架构中需要索引的位置。手动方法对于确定边标签应使用的默认方向很有用。如果您仅使用indexFor(…).analyze(),那么您最终可能会得到一堆索引,如果某些边只是简单地转过来,那么这些索引可能就不需要了。
After you develop all of your queries, as we did in Chapter 4, you can use either technique to figure out where you will need indexes in your production schema. The manual approach can be useful for figuring out the default direction you should use for an edge label. If you only use indexFor(…).analyze(), you could end up with a bunch of indexes that may not be needed if some of the edges are simply turned around.
The next recommendation is for when you are first setting up your production database.
加载您的数据;然后应用您的索引。
Load your data; then apply your indexes.
我们建议在应用索引之前加载数据,因为这将显著加快您的数据加载过程。此建议的应用取决于您团队的部署策略。
We recommend loading data before applying indexes because this will significantly speed up your data loading process. The application of this recommendation depends on your team’s deployment strategy.
由于生产图数据库的蓝绿部署模式很流行,这是一种常见的加载策略。如果这是您想要使用的模式类型,我们建议加载数据然后应用索引。有关最大限度减少系统停机时间的部署策略资源(如蓝绿模式),我们推荐Jez Humble 和 David Farley(Addison-Wesley)撰写的《持续交付:通过构建、测试和部署自动化实现可靠的软件发布》 。
This is a common loading strategy because of the popularity of blue-green deployment patterns for production graph databases. If this is the type of pattern you would like to use, we recommend loading data and then applying indexes. For a resource on deployment strategies to minimize system downtime, like the blue-green pattern, we recommend Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation by Jez Humble and David Farley (Addison-Wesley).
最后还有一条建议要推荐。
There is one last tip to recommend.
仅保留生产查询所需的边和索引。
Keep only the edges and indexes that you need for your production queries.
在开发和生产之间,您可能会发现遍历不需要的边标签。这是意料之中的。将架构移至生产环境时,请删除您不会使用的边标签。节省一些磁盘空间和保存它所花费的时间。
Between development and production, you may find edge labels that you do not need for your traversals. That is expected. When you move your schema into production, get rid of the edge labels you are not going to use. Save some space on disk and the time spent persisting it.
让我们将刚刚介绍的新数据建模建议应用到第 4 章中建立的开发模式中。这将是我们最后一次使用此示例和示例数据,之后我们将在后续章节中讨论不同的图模型。
Let’s apply the new data modeling recommendations we just covered to the development schema we built up in Chapter 4. This will be the last time we use this example and sample data before we move into different graph models in future chapters.
本节中其余的实施细节代表我们的 C360 示例的最终生产版本。
The remaining implementation details in this section represent the final production version of our C360 example.
首先,我们将向 C360 示例的架构添加所需的物化视图。然后,我们将介绍如何使用 DataStax Bulk Loader 加载数据。最后,我们将重新审视并更新 Gremlin 查询以使用新的优化。
First, we will add the required materialized views to the schema for our C360 example. Then we will go through an introduction of how to load data with DataStax Bulk Loader. Last, we will revisit and update our Gremlin queries to use the new optimizations.
我们需要对我们的开发模式做一些修改。首先,我们要找到在边缘上增加时间可以减少查询中需要处理的数据量的区域。
We have a few changes to make to our development schema. First, we want to find areas where adding time onto our edges will reduce the amount of data we need to process in a query.
在图 5-16中,我们以示例的第二个查询为例来直观地展示这一点。图 5-16逐步演示了 Gremlin 查询。
Let’s visualize this in Figure 5-16 for the second query of our example. Figure 5-16 steps through the Gremlin query.
dev.V().has("Customer","customer_id","customer_0").// Startout("uses").// 1in("charge").// 2has("timestamp",// 2between("2020-12-01T00:00:00Z",// 2"2021-01-01T00:00:00Z")).// 2out("Pay").// 3groupCount().// Endby("vendor_name").// Endorder(local).// Endby(values,decr)// End
dev.V().has("Customer","customer_id","customer_0").// Startout("uses").// 1in("charge").// 2has("timestamp",// 2between("2020-12-01T00:00:00Z",// 2"2021-01-01T00:00:00Z")).// 2out("Pay").// 3groupCount().// Endby("vendor_name").// Endorder(local).// Endby(values,decr)// End
将图 5-16与示例 5-5进行比较,可以说明两种生产模式策略。首先,我们可以应用非规范化来优化此查询。目前,时间仅存储在顶点上。如果我们对属性进行非规范化并将其存储在边上,Transaction我们可以减少此遍历所需的边数。timestampcharge图 5-16和示例 5-5中用标签对此进行了说明2。
Comparing Figure 5-16 with Example 5-5 illustrates two production schema strategies. First, we can apply denormalization to optimize this query. Currently, time is stored only on the Transaction vertex. We can reduce the number of edges required in this traversal if we denormalize the timestamp property and store it on the charge edge. This is illustrated in Figure 5-16 and Example 5-5 with the label 2.
在图 5-16中我们还看到,我们的查询是沿着charge边的方向进行的。这意味着我们需要在这个边标签上再添加一个物化视图。架构代码如下:
We also see in Figure 5-16 that our query walks against the direction of the charge edge. This means we need another materialized view on this edge label. The schema code is:
schema.edgeLabel("charge").from("Transaction").to("CreditCard").materializedView("Transaction_charge_CreditCard_inv").ifNotExists().inverse().create()
schema.edgeLabel("charge").from("Transaction").to("CreditCard").materializedView("Transaction_charge_CreditCard_inv").ifNotExists().inverse().create()
按照这种映射方式,我们可以找到三个边标签,其中反规范化可以优化我们的查询。这种优化通过对磁盘上的边进行排序,最大限度地减少了遍历需要处理的数据量。具体来说,如果我们还将属性添加到timestamp、withdraw_from和deposit_to边charge标签,我们可以最大限度地减少处理遍历所需的数据量。
Following this same style of mapping, we can find three edge labels where denormalization can optimize our queries. This optimization minimizes the amount of data a traversal has to process by sorting the edges on disk. Specifically, we can minimize the amount of data required to process our traversals if we also add the timestamp property to the withdraw_from, deposit_to, and charge edge labels.
我们一直在通过模式、查询和数据集成进行探索,以迭代方式引入和构建我们的 C360 示例。总之,技术概念和之前的讨论将我们带到了图 5-17所示的 C360 示例的最终生产模式。
We have been exploring through schema, queries, and data integration to iteratively introduce and build up our C360 example. Together, the technical concepts and previous discussions bring us to the final production schema for our C360 example shown in Figure 5-17.
我们在这里应用的调整是非规范化并添加到timestamp我们在遍历中使用的边缘标签上。
The adjustment we applied here is to denormalize and add timestamp onto the edge labels that we use in our traversals.
我们的边标签的架构代码的最终版本如示例 5-6所示。
The final version of the schema code for our edge labels is shown in Example 5-6.
schema.edgeLabel("withdraw_from").ifNotExists().from("Transaction").to("Account").clusterBy("timestamp",Text).// sort the edges by timecreate();schema.edgeLabel("deposit_to").ifNotExists().from("Transaction").to("Account").clusterBy("timestamp",Text).// sort the edges by timecreate();schema.edgeLabel("charge").ifNotExists().from("Transaction").to("CreditCard").clusterBy("timestamp",Text).// sort the edges by timecreate();
schema.edgeLabel("withdraw_from").ifNotExists().from("Transaction").to("Account").clusterBy("timestamp",Text).// sort the edges by timecreate();schema.edgeLabel("deposit_to").ifNotExists().from("Transaction").to("Account").clusterBy("timestamp",Text).// sort the edges by timecreate();schema.edgeLabel("charge").ifNotExists().from("Transaction").to("CreditCard").clusterBy("timestamp",Text).// sort the edges by timecreate();
为了使本书中的示例更易于理解,我们使用Text表示时间,然后使用诸如 之类的字符串进行查询2020-12-01T00:00:00Z。timestamp属性类型比 占用更少的磁盘空间Text, 可能是您最终应用程序的最佳选择。
To make the examples easier to follow in this book, we use Text to represent time and then query with strings such as 2020-12-01T00:00:00Z. The timestamp property type uses less space on disk than Text and may be the best option for your final application.
总而言之,我们只需要从开发模式到生产模式进行以下更改:
Altogether, we need only the following changes from our development schema to our production schema:
将属性非规范化为五个边缘标签
Denormalize a property onto five edge labels
添加三个物化视图以反向行走三个边
Add three materialized views to walk three edges in reverse
Let’s detail how to use a bulk loading tool to insert the data into your graph database.
我们创建了一个脚本,将所有数据从 CSV 文件加载到 DataStax Graph 中。DataStax Bulk Loader 是生产中加载数据的最快方法。我们为数据模型中的每个顶点和边标签提供了一个 CSV 文件。让我们介绍一下加载顶点的一般过程,然后展示加载边的过程。
We created a script that loads all of the data into DataStax Graph from CSV files. DataStax Bulk Loader is the fastest way to load data in production. We provided a CSV file for each vertex and edge label from our data model. Let’s walk through the general process for loading vertices and then show the same for edges.
让我们查看表 5-4中所有包含的顶点数据文件以及每个文件的简要说明。
Let’s look at all of the included vertex datafiles and a brief description for each file in Table 5-4.
| 顶点文件 | 描述 |
|---|---|
账户.csv Accounts.csv |
帐号 ID,每行一个 The account IDs, one per line |
信用卡.csv CreditCards.csv |
信用卡 ID,每行一个 The credit card IDs, one per line |
客户.csv Customers.csv |
客户详细信息,每行一个 Customer details, one per line |
贷款.csv Loans.csv |
贷款 ID,每行一个 The loan IDs, one per line |
交易.csv Transactions.csv |
交易详情,每行一个 Transaction details, one per line |
供应商.csv Vendors.csv |
供应商详细信息,每行一个 Vendor details, one per line |
让我们通过检查来查看如何使用 DataStax Bulk Loader 加载顶点数据的示例Transactions.csv。 的前五行如表 5-5Transactions.csv所示。每行包含有关事务的三条信息,这些信息映射到我们预期的架构。您还可以在表5-5中看到,所有事务都加载了一种类型,因为我们的遍历之一是根据图的结构改变此属性。unknown
Let’s see an example of how to load vertex data with DataStax Bulk Loader by examining Transactions.csv. The first five lines of Transactions.csv are shown in Table 5-5. Each line contains three pieces of information about the transaction that map to our expected schema. You also see in Table 5-5 that all transactions are loaded with an unknown type because one of our traversals is to mutate this property according to the graph’s structure.
| 交易 ID | 时间戳 | 交易类型 |
|---|---|---|
219 219 |
2020-11-10T01:00:00Z 2020-11-10T01:00:00Z |
未知 unknown |
23 23 |
2020-12-02T01:00:00Z 2020-12-02T01:00:00Z |
未知 unknown |
114 114 |
2019-06-16T01:00:00Z 2019-06-16T01:00:00Z |
未知 unknown |
53 53 |
2020-06-05T01:00:00Z 2020-06-05T01:00:00Z |
未知 unknown |
表 5-5中最重要的一行是 header,在附带的加载脚本中,header 也是文件和数据库的映射配置,header 和 DataStax Graph 中的属性名必须一致。
The most important line in Table 5-5 is the header. In the accompanying loading scripts, the header doubles as the mapping configuration between the file and the database. The header and the property names in DataStax Graph must match.
我们可以使用命令行批量加载CSV文件加载实用程序,如示例5-7所示。
We can load the CSV file using the command-line bulk loading utility, as shown in Example 5-7.
1dsbulk load -url /path/to/Transactions.csv2-g vicinitys_prod3-v Transaction4-headertrue
1dsbulk load -url /path/to/Transactions.csv2-g neighborhoods_prod3-v Transaction4-headertrue
示例 5-7显示了在本地主机上加载顶点数据的最基本方法。第 1 行的第一部分,dsbulk load,从命令行调用加载工具。接下来的四个参数可以按任意顺序出现,分别是-url、-g、-v和-header:
Example 5-7 shows the most basic way to load vertex data on your localhost. The first part of line 1, dsbulk load, invokes the loading tool from the command line. The next four parameters, which can come in any order, are -url, -g, -v, and -header:
该-url参数表示 CSV 的存储位置。
The -url parameter indicates where the CSV is stored.
-g是图的名称。
-g is the name of the graph.
-v是顶点标签。
-v is the vertex label.
-header指定应根据文件的标题来映射数据。
-header specifies that the data should be mapped according to the file’s header.
DataStax dsbulk 文档包含其他加载选项的所有详细信息,包括加载到分布式集群、配置文件等。
The DataStax dsbulk documentation contains all the details for other loading options, including loading into a distributed cluster, configuration files, and much more.
接下来我们看一下边缘数据和加载过程。
Next, let’s take a look at the edge data and loading process.
所有包含的边缘数据文件和简要每个部分的描述列于表 5-6中。
All of the included edge datafiles and a brief description for each are listed in Table 5-6.
| 边文件 | 描述 |
|---|---|
收费.csv charge.csv |
The |
存款到.csv deposit_to.csv |
The |
欠款.csv owes.csv |
The |
拥有.csv owns.csv |
The |
工资贷款.csv pay_loan.csv |
The |
付款方.csv pay_vendor.csv |
The |
使用.csv uses.csv |
The |
提款方式.csv withdraw_from.csv |
The |
让我们通过检查来查看如何使用 DataStax Bulk Loader 加载边缘数据的示例deposit_to.csv。 的前五行如表 5-7deposit_to.csv所示。每行包含与我们的模式相对应的存款的三条信息:、和。transaction_idacct_idtimestamp
Let’s see an example of how to load edge data with DataStax Bulk Loader by examining deposit_to.csv. The first five lines of deposit_to.csv are shown in Table 5-7. Each line contains three pieces of information about the deposit that map to our schema: the transaction_id, the acct_id, and a timestamp.
| 交易_交易_id | 帐户 ID | 时间戳 |
|---|---|---|
185 185 |
acct_5 acct_5 |
2020-01-19T01:00:00Z 2020-01-19T01:00:00Z |
251 251 |
acct_5 acct_5 |
2020-07-25T01:00:00Z 2020-07-25T01:00:00Z |
247 247 |
acct_5 acct_5 |
2020-03-06T01:00:00Z 2020-03-06T01:00:00Z |
214 214 |
acct_14 acct_14 |
2020-06-11T01:00:00Z 2020-06-11T01:00:00Z |
表 5-7中最重要的行是标题;标题必须与 DataStax Graph 中的表架构相匹配。DataStax Graph 会为表主键的一部分的边缘属性自动生成不同的列名。生成的名称将顶点标签附加到属性名称的前面,例如Transaction_in front oftransaction_id和Account_in front of acct_id。
The most important line in Table 5-7 is the header; the header has to match the table schema in DataStax Graph. DataStax Graph autogenerates different column names for the edge properties that are part of the table’s primary key. The generated name appends the vertex label to the front of the property name, such as Transaction_ in front of transaction_id and Account_ in front of acct_id.
我们可以使用命令行批量加载实用程序加载边缘 CSV 文件,如示例 5-8所示。
We can load the edge CSV file using the command-line bulk loading utility, as shown in Example 5-8.
1dsbulk load -url /path/to/Transactions.csv2-g vicinitys_prod3-e deposit_to4-来自交易5-到帐户6-headertrue
1dsbulk load -url /path/to/Transactions.csv2-g neighborhoods_prod3-e deposit_to4-from Transaction5-to Account6-headertrue
示例 5-8展示了在本地主机上加载边缘数据的最基本方法。第 1 行的第一部分,dsbulk load从命令行调用加载工具,就像我们在上一个示例中看到的那样。接下来的六个参数可以按任意顺序出现:-url、-g、-e、-to、-from和-header。-url参数表示 CSV 的存储位置,-g是图的名称,-e是边标签,-from是传出顶点标签,-to是传入顶点标签,并-header表示根据文件的标题映射数据。
Example 5-8 shows the most basic way to load edge data on your localhost. The first part of line 1, dsbulk load, invokes the loading tool from the command line, as we saw in the previous example. The next six parameters can come in any order: -url, -g, -e, -to, -from, and -header. The -url parameter indicates where the CSV is stored, -g is the name of the graph, -e is the edge label, -from is the outgoing vertex label, -to is the incoming vertex label, and -header says to map the data according to the file’s header.
随附的脚本展示了如何加载本章和本书所有示例的所有顶点和边标签。请前往本书 GitHub 存储库中的数据目录获取每章的数据和加载脚本。
The accompanying scripts show how to load all vertex and edge labels for this chapter and all examples in this book. Please head to the data directory within book’s GitHub repository for the data and loading scripts for each chapter.
您将在本书的其余部分看到更多将数据批量加载到 DataStax Graph 的示例。现在,让我们进入实施细节的下一阶段:使用 Gremlin 查询我们的图。
You will see many more examples of bulk loading data into DataStax Graph throughout the rest of the book. For now, let’s move on to the next stage of our implementation details: querying our graph with Gremlin.
现在我们已经更新了边标签和索引,让我们重新审视查询和每个查询的结果。这些查询与我们在第 4 章中介绍的查询相同,但有两个变化。首先,我们现在可以使用生产遍历源g。我们已经脱离了开发模式,开始针对生产应用程序编写查询。其次,我们将更新每个查询以使用我们的新生产模式。除了物化视图之外,我们还将在边上使用时间。
Now that we have updated our edge labels and indexes, let’s revisit the queries and the results for each query. These are the same queries we walked through in Chapter 4, but there are two changes. First, we now can use the production traversal source g. We have moved out of development mode into writing queries against a production application. Second, we are going to update each query to use our new production schema. We will be using time on edges in addition to the materialized view.
让我们首先重新审视查询 1。
Let’s start by revisiting Query 1.
我们为设置模式和图数据所做的所有工作使得示例 5-9中的查询能够简单回答我们的第一个问题。
All of the work we did to set up the schema and graph data empowers the simplicity of the query in Example 5-9 to answer our first question.
g.V().has("Customer","customer_id","customer_0").out("owns").inE("withdraw_from","deposit_to").// uses materialized view on deposit_toorder().// sort the edgesby("timestamp",desc).// by timelimit(20).// walk through the 20 most recent edgesoutV().// walk to the transaction verticesvalues("transaction_id")// get the transaction_ids
g.V().has("Customer","customer_id","customer_0").out("owns").inE("withdraw_from","deposit_to").// uses materialized view on deposit_toorder().// sort the edgesby("timestamp",desc).// by timelimit(20).// walk through the 20 most recent edgesoutV().// walk to the transaction verticesvalues("transaction_id")// get the transaction_ids
结果保持不变,但是通过对边进行排序,查询处理的数据更少:
The results remain the same, but the query processed less data by sorting the edges:
"184","244","268",...
"184","244","268",...
从第 4 章中的查询到此示例的主要变化可以通过添加一个字符来看出:E。查询从使用更改in()为inE()。这一字符更改使用物化视图和边的排序顺序。
The main change from the query in Chapter 4 to this example can be seen with the addition of a single character: E. The query changed from using in() to inE(). This one character change uses a materialized view and the sorted order of edges.
为了深入了解细节,让我们回顾一下在开发模式下如何遍历这些数据。在第 4 章中,该in()步骤直接遍历边,到达顶点,忽略边的方向,然后对顶点对象进行排序。这对于弄清楚如何遍历我们的图数据来说很简单。
To dig into the details, let’s recall how we walked through this data in development mode. In Chapter 4, the in() step walked directly through edges, to the vertices, ignoring the edges’ direction, and then sorted the vertex objects. That was simple enough for figuring out how to walk through our graph data.
在生产环境中,我们需要确保此查询仅处理其所需的数据。在示例 5-9中,我们通过使用inE()、按时间对所有边进行排序以及仅遍历最近的 20 条边来优化此查询。
In a production environment, we would need to ensure that this query processes only the data it needs. In Example 5-9, we optimized this query by using inE(), sorting all edges by time, and traversing only the 20 most recent edges.
对所有边进行排序需要从我们的模式中引入三个概念。首先,我们使用在deposit_to和withdraw_from边标签上构建的物化视图。其次,我们使用的聚类键,deposit_to因为边在磁盘上按时间排序。最后,我们使用边标签的聚类键,withdraw_from因为这些边在磁盘上也按时间排序。
The sorting of all edges requires three concepts from our schema. First, we use the materialized views we built on the deposit_to and withdraw_from edge labels. Second, we use the clustering key for deposit_to because the edges are ordered on disk by time. And last, we use the clustering key for the withdraw_from edge label because these edges are also ordered on disk by time.
这仅仅是一个很小的变化,却带来了大量的优化:从in()到inE()。让我们看看我们需要对下一个查询做什么才能利用我们的新模式。
That is a significant amount of optimization from just a small change: from in() to inE(). Let’s look at what we need to do to our next query to take advantage of our new schema.
我们将应用相同的模式来优化下一个查询。我们希望利用边缘时间的非规范化charge来最大限度地减少我们需要处理的数据量。在 Gremlin 中,这看起来像示例 5-10。
We are going to apply the same pattern to optimize our next query. We want to take advantage of the denormalization of time on the charge edge to minimize the amount of data we need to process. In Gremlin, this looks like Example 5-10.
g.V().has("Customer","customer_id","customer_0").out("uses").inE("charge").// access edgeshas("timestamp",// sort edgesbetween("2020-12-01T00:00:00Z",// beginning of December 2020"2021-01-01T00:00:00Z")).// end of December 2020outV().// traverse to transactionsout("pay").hasLabel("Vendor").// traverse to vendorsgroupCount().by("vendor_name").order(local).by(values,desc)
g.V().has("Customer","customer_id","customer_0").out("uses").inE("charge").// access edgeshas("timestamp",// sort edgesbetween("2020-12-01T00:00:00Z",// beginning of December 2020"2021-01-01T00:00:00Z")).// end of December 2020outV().// traverse to transactionsout("pay").hasLabel("Vendor").// traverse to vendorsgroupCount().by("vendor_name").order(local).by(values,desc)
结果和以前一样:
The results are the same as before:
{"Target":"3","Nike":"2","Amazon":"1"}
{"Target":"3","Nike":"2","Amazon":"1"}
示例 5-10中应用的更改和优化遵循与示例 5-9相同的模式。这次,我们inE()只访问传入边。我们使用聚类键timestamp将范围函数应用于边。一旦我们找到某个范围内的所有边,我们就转到事务顶点并继续遍历,就像第 4 章中一样。
The change and optimization we applied in Example 5-10 follow the same pattern as Example 5-9. This time, we used inE() to access only incoming edges. We used the clustering key timestamp to apply a range function to the edges. Once we found all edges in a certain range, we moved to the transaction vertices and continued our traversal, as in Chapter 4.
这引出了我们第 4 章的最后一个疑问。
This brings us to our last query from Chapter 4.
在查看查询的最终版本之前,让我们先考虑一下此查询正在处理的数据。在此查询中,我们从 Aaliyah 开始,查找她账户中的所有提款。此查询没有限制或时间要求;我们希望找到所有提款。这意味着我们不会在边缘上使用任何时间范围。
Let’s think about the data this query is processing before we look at the final version of the query. In this query, we are starting from Aaliyah and finding all withdrawals from her accounts. There are no limits or time requests for this query; we want to find them all. This means that we will not be using any time ranges on the edges.
此外,此查询的每一步都使用现有的传出边标签。因此,我们不需要任何物化视图,可以走出现有的边来满足此查询。因此,我们只需要切换到我们的生产遍历源,此查询就可以开始了 - 参见示例 5-11。
Further, every step along this query uses an existing outgoing edge label. Because of this, we do not need any materialized views and can walk out the existing edges to satisfy this query. Therefore, we need only to switch to our production traversal source, and this query will be ready to go—see Example 5-11.
g.V().has("Customer","customer_id","customer_4").// accessing Aaliyah's vertexout("owns").// walking to the accountin("withdraw_from").// Only consider withdrawsfilter(out("pay").// walking out to loans or vendorshas("Loan","loan_id","loan_18")).// only keep loan_18property("transaction_type",// mutating step: set the "transaction_type""mortgage_payment").// to "mortgage_payment"values("transaction_id","transaction_type")// return the id and type
g.V().has("Customer","customer_id","customer_4").// accessing Aaliyah's vertexout("owns").// walking to the accountin("withdraw_from").// Only consider withdrawsfilter(out("pay").// walking out to loans or vendorshas("Loan","loan_id","loan_18")).// only keep loan_18property("transaction_type",// mutating step: set the "transaction_type""mortgage_payment").// to "mortgage_payment"values("transaction_id","transaction_type")// return the id and type
结果与第 4 章的结果完全相同:
The results look exactly the same as those in Chapter 4:
"144","mortgage_payment","153","mortgage_payment","132","mortgage_payment",...
"144","mortgage_payment","153","mortgage_payment","132","mortgage_payment",...
通过示例 5-11,我们完成了从开发到生产模式和查询的转换。我们鼓励您应用“高级 Gremlin:塑造查询结果”中塑造查询结果的思维过程,以创建更强大的有效负载和数据结构以在您的应用程序中共享。
With Example 5-11, we have concluded the transformation from development to our production schema and queries. We encourage you to apply the thought process of shaping query results from “Advanced Gremlin: Shaping Your Query Results” to create more robust payloads and data structures to share within your application.
我们认为从第 4 章到本章介绍的主题和生产优化的过渡是学习如何在 Apache Cassandra 中处理图数据的最后阶段。在此过程中,您会遇到限制,然后得到解决。随着我们的进展,我们将在更短的迭代中看到更多这样的情况。
We consider the transition from Chapter 4 to the topics and production optimizations presented in this chapter to be the final stage of learning how to work with graph data in Apache Cassandra. Along the way, you experienced limitations, followed by their resolutions. We will see more of that as we go along but in shorter iterations.
在第 4 章中,我们介绍了将数据映射到分布式图数据库的数据建模技巧。在本章中,我们补充了这些技巧,并提供了优化生产图数据库的具体方法。让我们重新回顾所有 10 条技巧,回顾一下从开发到生产所经历的历程(图 5-18)。
Throughout Chapter 4, we presented data modeling tips for mapping your data into a distributed graph database. In this chapter, we augmented those tips with specific ways to optimize your production graph database. Let’s revisit all 10 tips to recall the journey we went through from development to production (Figure 5-18).
这 10 条建议是重新开始使用新数据集和用例的基础。我们将在接下来的章节中反复应用它们。随着我们探索分布式图应用程序的不同常见结构,我们将找到更多建议添加到此列表中。
These 10 tips are foundational to starting over with a new dataset and use case. We will be applying them repeatedly in the coming chapters. And we will find more recommendations to add to this list as we explore different common structures for distributed graph applications.
从这里开始,我们认为您已经准备好解决更深层、更复杂的图问题,例如路径、递归行走、协同过滤等。
From here, we think you are ready to tackle deeper and more complex graph problems such as paths, recursive walks, collaborative filtering, and more.
当今最先进的图用户是那些愿意通过反复试验来学习的人。我们收集了他们迄今为止所学到的知识,并将在接下来的章节中结合新用例向您介绍这些细节。
The most advanced graph users today are those who are willing to learn through trial and error. We have collected what they have learned so far and will be walking you through those details within the context of new use cases in the coming chapters.
在我们看来,获得新技术和新思维方式的关注是一个过程。我们介绍了其他人迄今为止取得的主要基础里程碑。现在,您可以加入我们,将图思维应用于生产应用程序来解决复杂问题。
As we see it, gaining traction with new technology and new ways of thinking is a journey. We have presented the major foundational milestones others have reached so far. Now, you are ready to come along with us and apply graph thinking in production applications to solve complex problems.
在第 6 章中,我们将介绍人们将图思维扩展到数据的最流行方法之一。我们将解决在自组织传感器通信网络中的边缘计算和分层图数据交汇处发现的复杂问题。
In Chapter 6, we’ll look at one of the most popular ways for people to extend graph thinking into their data. We will solve a complex problem found at the intersection of edge computing and hierarchical graph data in a self-organizing communication network of sensors.
用于邻域探索的 C360 应用程序是目前分布式图技术最流行的用途。C360 示例还可以很好地介绍分布式系统、图论和功能查询语言中的大量概念。
C360 applications for neighborhood exploration are the most popular use of distributed graph technology at this time. A C360 example also serves as a great introduction to a plethora of concepts in distributed systems, graph theory, and functional query languages.
但那里还有什么呢?
But what else is out there?
在接下来的两章中,我们将进一步理解数据的邻域,并将图思维应用于分层数据。
In the next two chapters, we step beyond understanding neighborhoods of data and apply graph thinking to hierarchical data.
Hierarchical data represents concepts that naturally organize into a nested structure of dependencies.
在撰写本章时,层次结构数据是分布式图应用程序中第二流行的数据形状。
At the time of writing this chapter, hierarchically structured data is the second most popular shape of data used in distributed graph applications.
本章主要分为五个部分。
There are five main sections to this chapter.
第一部分介绍了来自现实场景的多个分层数据示例。随着数据形态的改变,术语也开始大量涌现;第二部分介绍了许多新术语和示例。本章的第三部分介绍了我们将在示例中使用的问题陈述、数据和模式。对于我们的数据,有两种主要的查询样式可用于处理分层数据。第四部分解释了第一种查询模式:从层次结构的底部走到顶部。最后一部分展示了第二种查询模式:从层次结构的顶部走到底部。
The first section walks through multiple examples of hierarchical data from real-world scenarios. With a new shape of data comes another flood of terminology; the second section introduces new terms with many examples. The third section of the chapter introduces the problem statement, data, and schema we will use in our examples. With our data, there are two main styles of queries for working with hierarchical data. The fourth section explains the first query pattern: walking from the bottom of the hierarchy to the top. The last section shows the second query pattern: walking from the top of the hierarchy to the bottom.
最后一节中的最后一个查询模式揭示了在生产应用程序中处理深度嵌套数据最困难的方面之一。我们在本章的结尾展示了事情如何发生,为第 7 章奠定了基础,在第 7 章中,我们将解释为什么以及如何在生产中修复这些问题。
The final query pattern in the last section unveils one of the most difficult aspects of working with deeply nested data in a production application. We end this chapter showing how things can break, setting the stage for Chapter 7, in which we explain why and how to fix them for production.
我们常常使用图来描述我们日常使用的概念中的自然嵌套结构。我们经常在产品结构、版本控制系统或人员的数据中看到层次结构。让我们深入研究这三个示例,并说明如何用图推理嵌套数据。
More often than not, we already use graphs to describe the natural, nested structure within concepts we use every day. We often see hierarchical structure within the data about a product’s structure, version control systems, or people. Let’s dive into each of these three examples and illustrate how we reason about nested data with a graph.
在任何物料清单 (BOM) 应用程序中都可以看到探索数据内自然层次结构的第一个地方。BOM 应用程序通过将原材料、装配、零件和制造产品所需的数量的嵌套依赖关系关联到端到端管道中来描述产品的结构。图 6-1说明了制造波音 737 飞机的依赖关系。
The first place to explore natural hierarchies within data can be seen in any bill of materials (BOM) application. A BOM application describes a product’s structure by associating the nested dependencies of the raw materials, assemblies, parts, and quantities needed to create a product in an end-to-end pipeline. Figure 6-1 illustrates the dependencies for constructing a Boeing 737 airplane.
当您考虑制造一架飞机所需的 BOM 时,您可以看到数据的自然层次结构或“嵌套性”。考虑这个问题:制造一架波音 737 飞机需要多少个螺丝?通过遍历组装飞机的组件层次结构,可以找到答案:一架飞机有两个机翼,每个机翼都有一个涡轮发动机,发动机有一个需要 12 个螺丝的轴,等等。
You can see the natural hierarchy or “nestedness” of data when you consider the BOM required to build an airplane. Consider this question: how many screws are used to construct a Boeing 737? The answer can be found by walking through the hierarchy of components that are assembled to construct a plane: a plane has two wings, each wing has one turbine engine, the engine has a shaft that requires 12 screws, and so on.
当我们谈论 BOM 中的层次结构时,我们指的是对平面的每个部分进行相同的分解,以计算出构建整个物体所需的螺钉总数。制造工厂、装配线和工业工程中的无数领域都存在这种类型的数据层次结构。
When we talk about hierarchies in a BOM, we are talking about following that same deconstruction for every part of the plane to figure out the total number of screws it takes to build the whole object. This type of hierarchy in data exists for manufacturing plants, assembly lines, and myriad areas within industrial engineering.
您还可以在软件工程过程中发现层次结构和图数据结构。最受欢迎的以及用于补充本书技术内容的是 Git。
You also find hierarchies and graph data structures in software engineering processes. The most popular one, and the one used to supplement this book with technical content, is Git.
Git 的版本控制系统形成了一个层次结构。您可以将此版本控制系统视为包含三个独立的树结构:工作目录、索引和头。版本控制系统中的每棵树都有不同的特定用途:写入、暂存或提交更改。为了说明这一点,图 6-2显示了项目的依赖关系图在每个更改状态之间如何可观察。
Git’s version control system forms a hierarchy. You can think of this version control system as containing three separate tree structures: the working directory, the index, and the head. Each tree in the version control system has a different and specific purpose: writing, staging, or committing changes. To illustrate this, Figure 6-2 shows how a dependency graph for your project is observable between each state of changes.
您也可以将 Git 视为一个链。从这个角度来看,版本控制系统会创建带有分支的依赖链。无论您喜欢如何看待它,Git 版本控制系统中的数据形状都会形成嵌套层次结构。
You can also think of Git as a chain. In this light, the version control system creates a chain of dependencies with forks. Either way you prefer to think of it, the shape of data within Git’s version control system forms a nested hierarchy.
让我们看看第三个例子,我们在数据中发现了层次结构。
Let’s look at a third example where we find hierarchical structure in data.
自然等级制度的最后一个例子是人们如何自我组织。有两个主要的例子:家谱和公司层级结构。要真正将家庭层级结构及其关系带入图数据结构,请考虑您自己的家庭。尽可能回想过去,也许可以回想您的曾曾祖父母。从很久以前的祖先追溯您的家族血统,可以形成一个跨越多个层次的父母和子女层级结构。家庭中的亲子依赖关系是自然数据中层级结构的最佳示例之一。
The last example of natural hierarchies can be found in how people self-organize. There are two main examples of this: family trees and corporate hierarchies. To really bring home hierarchies and their relationships to graph data structures, think about your own family. Think back as far as you can, maybe to a great-great-grandparent. Tracing your family’s lineage from a long-ago ancestor to you forms a hierarchy of parents and children across many levels. The parent-child dependency within a family is one of the best examples of hierarchy in natural data.
我们在员工队伍中创建了相同类型的组织;图 6-3显示了一个企业层级结构示例。
We create the same type of organization within our workforces; an example corporate hierarchy is shown in Figure 6-3.
公司结构看起来有点像家谱。经理与员工的关系就像你家庭的父母与孩子的关系。我们以团体形式工作,并以与家族相同的结构组织自己。从广义上讲,首席执行官有一个副总裁团队,每个副总裁都有一个董事团队,董事管理个人贡献者团队。
Corporate structures look somewhat similar to family trees. The manager-employee relationship is the same as your family’s parent-child relationship. We work in groups and organize ourselves in the same structure as our lineage. Broadly speaking, a CEO has a team of vice presidents, each vice president has a team of directors, and directors manage teams of individual contributors.
很高兴意识到我们已经使用嵌套关系来描述常见概念,但让我们探讨一下为什么这种数据形状目前是图技术第二流行的用途。
It is great to realize how we already use nested relationships to describe common concepts, but let’s explore why this shape of data is currently the second most popular use of graph technology.
图技术可以以更自然的方式表示数据内的嵌套关系。更自然的数据表示可以使代码维护更简单,从而提高开发团队的工作效率。
Graph technology enables a more natural way to represent the nested relationships within data. The more natural representation of data yields simpler code to maintain and makes development teams more productive.
例如,在为本书与世界各地的图用户进行的众多对话中,我们发现一位早期采用者告诉我们,他的团队“将 HBase 上的 150 行查询转换为 20 行 Gremlin”。这正是工程团队采用图技术来建模、存储和查询分层结构数据的原因。
For example, during one of the many conversations we had with graph users around the world for this book, we found an early adopter who told us that his team “translated 150 lines of a query on top of HBase into 20 lines of Gremlin.” This is exactly why engineering teams are adopting graph technology to model, store, and query hierarchically structured data.
代码库的简化以及由此带来的开发人员工作效率的提高一直是我们与用户交流的共同主题。这鼓励更多团队使用分布式图技术来建模、推理和解决具有自然层次结构的复杂问题。
The simplification to the codebase, and the resulting enhancement to developer productivity, has been a common theme in our conversations with users. This is encouraging more teams to use distributed graph technology to model, reason, and solve complex problems with natural hierarchies.
那么公司结构、版本控制和产品结构有什么共同点呢?
So what do corporate structures, version control, and product structures have in common?
当我们查看每个概念的数据时,我们会看到嵌套或分层的数据。使用图技术时,这些层次结构称为树。
When we look at the data for each of these concepts, we see nested or hierarchical data. When using graph technologies, these hierarchies are called trees.
为了为我们所看到的内容奠定基础,让我们来了解一下新一波的图术语,以便我们可以教您如何在数据森林中看到树木。
To lay the foundation for what we see, let’s take a tour of a new wave of graph terms so that we can teach you how to see the trees within this forest of data.
本节中的定义汇集了数据库和图论界的术语。有关数据存储模型的概念(如层次结构)是有关数据库的流行术语。定义数据中可观察结构的术语(如树和森林)源自图论。
The definitions throughout this section bring together terminology from the database and graph-theoretic communities. Concepts about the data’s storage model, like hierarchy, are popular terms about databases. Terms that define observable structures within the data, such as tree and forest, originate in graph theory.
这些术语的来源并不重要。能够区分与存储相关的概念和与样本数据相关的概念才是关键。我们已经在第 2 章中遇到了混淆图数据和图模式的概念是多么容易。我们在分层数据中再次看到了同样的混淆。来自多个社区的术语不断混合解释了为什么图技术很难掌握。
Where the terms come from does not matter. Being able to distinguish between concepts related to storage versus those related to sample data does matter. We already ran into how easy it can be to confuse concepts from graph data and graph schema in Chapter 2. We see the same confusion again with hierarchical data. The constant mixture of terminology from multiple communities explains why graph technology can be difficult to pick up.
为了帮助您了解两个世界,让我们看一些可以形象地描述一些关键术语的例子。
To help you navigate both worlds, let’s look at some examples that can put a picture to some key terms.
我们已经多次使用过树这个术语,但并未对其进行定义。现在让我们来定义它。
We have used the term tree a few times without defining it. Let’s do that now.
我们将在下一节中正式定义循环。现在,让我们重新回顾一下我们的企业层级结构示例,看看树在实践中的作用。图 6-4中的图从 CEO 一直到软件工程师,构成了一棵树。
We will formally define a cycle in the next section. For now, let’s revisit our example corporate hierarchy to see trees in practice. The graph in Figure 6-4, from the CEO down to the software engineers, forms a tree.
检查图 6-4中的边可发现,每个顶点都只有一条边指向它。如果您对贵公司的企业树进行建模,并将其结构与竞争对手的企业树进行比较,那么您将看到两棵独立的树。这两棵树合在一起就构成了一片森林。是的,数学家在提出这些官方图论术语时确实有点好玩;让我们开始玩双关语吧。
Examining the edges in Figure 6-4 shows that every vertex has only one edge pointing to it. If you modeled your company’s corporate tree and compared its structure to your competitor’s corporate tree, you would be looking at two separate trees. Those two trees together make a forest. Yes, mathematicians had a bit of fun when coming up with these official graph theory terms; let the puns begin.
There are two special types of vertices within hierarchical data: parents and children.
父顶点在层次结构中高一级。
A parent vertex is one step higher in the hierarchy.
在层次结构中,子顶点比父顶点低一级。
A child vertex is one step below a parent in the hierarchy.
您可以在图 6-4中识别这些术语的示例。图 6-4中的产品副总裁是营销总监的父级。营销总监是产品副总裁的子顶点。
You can identify examples of these terms in Figure 6-4. The VP of Product in Figure 6-4 is the parent of the Director of Marketing. The Director of Marketing is a child vertex of the VP of Product.
以下定义解释了根和叶如何符合对分层数据中父子依赖关系的传统理解。
The following definitions explain how roots and leaves fit into the traditional understanding of parent and child dependencies in hierarchical data.
A root is the topmost parent vertex; a root is the beginning of the dependency chain within a hierarchy.
A leaf is the last child vertex in a dependency chain within a hierarchy; a leaf vertex has a degree of one.
查看图 6-4中的图,CEO 是根,每个软件工程师是叶子。
Looking at the diagram in Figure 6-4, the CEO is the root, and each software engineer is a leaf.
在应用程序中,层次结构中的数据通常以三种方式之一引用:按其邻域、按其深度或按其路径。
Data within a hierarchy is usually referenced in one of three ways in an application: by its neighborhoods, by its depth, or by its path.
首先,应用程序根据其父节点或子节点引用分层数据。从某个顶点开始,您将向上走一层以报告父节点,或者向下走一层以报告其子节点。这与我们在过去几章中所做的在社区中走动非常相似。
First, an application references hierarchical data according to its parents or its children. From a certain vertex, you would walk up one level to report the parent vertex, or you would walk down one level to report its children. This is very similar to walking around neighborhoods like we have been doing in the past few chapters.
其次,应用程序根据其与根或叶的距离引用分层数据。我们使用术语“深度”来指代分层数据中的这种距离。
Second, an application references hierarchical data according to its distance from either a root or a leaf. We use the term depth to refer to this distance in hierarchical data.
In a hierarchy, depth is the distance of any vertex in the graph to its root; the maximum depth in a tree is found from its root.
让我们看一下我们的公司层次结构树,以深入了解这些数据。
Let’s take a look at our corporate hierarchy tree to apply depth to this data.
当你思考公司报告结构时,你可能一直在考虑每个职位与 CEO 之间的距离。图 6-5为我们提供了这种自然关联的正式术语。查看图 6-5中的层次结构,我们说产品副总裁与 CEO 之间的距离为 1。工程总监与 CEO 之间的距离为 2。最后,软件工程师与 CEO 之间的距离为 3。
While you have been thinking about corporate reporting structures, you probably have been considering how far each position is from the CEO. Figure 6-5 gives us a formal terminology for that natural association. Looking at the hierarchy in Figure 6-5, we say that the VP of Product is 1 away from the CEO. The Director of Engineering has a depth of 2 from the CEO. Last, a software engineer has a depth of 3 from the CEO.
在应用程序中使用分层数据的第三种方式需要了解两段数据之间的完整依赖链。要访问完整的依赖链,需要从根到叶或从叶到根遍历数据。这给我们带来了三个有用的术语。
The third way that ,.hierarchical data is used in an application requires understanding the full dependency chain between two pieces of data. Accessing the full dependency chain requires traversing through the data from the root to the leaves or vice versa. This brings us to three useful terms.
A walk through a graph is a sequence of visited vertices and edges. Vertices and edges can be repeated.
A path through a graph is a sequence of visited vertices and edges. Vertices and edges cannot be repeated.
A cycle is a path where the starting and ending vertices are the same.
让我们看一下图 6-6,它展示了我们的公司树中从根到叶的路径示例。
Let’s look at Figure 6-6, which shows an example of a path from the root to a leaf in our corporate tree.
图 6-6中的路径从 CEO 出发,经过两个不同的层级,到达软件工程师。这是一条路径,因为沿途的所有数据仅使用一次。换句话说,此示例路径中没有重复的边或顶点:CEO → 产品副总裁 → 工程总监 → 软件工程师 3。
The path in Figure 6-6 walks from the CEO through two different levels to get to a software engineer. This is a path because all data along the way is used only once. In other words, there are no repeated edges or vertices in this example path: CEO → VP of Product → Director of Engineering → Software Engineer 3.
将分层数据自然地转化为我们的思考和推理方式正是团队使用图技术的原因。我们使用图技术表示、存储和查询分层数据的方式已经自然而然地遵循了我们的思考方式!
The natural translation of hierarchical data into how we think and reason about it is exactly why teams are using graph technology. The way that we represent, store, and query hierarchical data with graph technology already follows how we think about it, naturally!
Now that we understand the terminology, let’s set up the example we will be using in the next two chapters.
如果您使用电,那么您每天的每一刻都可能为分布式层次数据做出贡献。
If you use electricity, you likely contribute to a distributed hierarchy of data every moment of your day.
每小时,你只需轻按家中或公司里的电灯开关,即可为分布式分层图数据结构做出贡献。你的电力供应商会跟踪你家或工作场所在某个时间间隔(可能每 15 分钟)内使用的电量。这些读数会被收集并发送回你的电力公司,电力公司会将它们汇总起来。
On an hourly basis, you contribute to distributed, hierarchical graph data structures by flicking a light switch in your house or business. Your power supplier tracks how much energy your home or workplace uses on a time interval, likely every 15 minutes. These readings are collected and sent back to your power company, which aggregates them.
您的电力公司甚至可能通过电力链中的自组织传感器网络将这些读数从一个电力接收者分发到另一个电力接收者。通过自组织网络传输这些读数是我们不断接触的最美丽、最动态、最分层的图问题之一。
Your power company may even distribute these readings from one power recipient to another via a self-organizing network of sensors within the power chain. The transfer of these readings through a self-organizing network is one of the most beautiful, dynamic, and hierarchical graph problems we interact with on a constant basis.
本章中的示例模拟了在自组织传感器和塔网络中发现的动态和分层通信网络,就像电压水平从你的家传送到电力公司的方式一样。
The example in this chapter models the dynamic and hierarchical network of communication found within a self-organizing network of sensors and towers, much as how voltage levels are communicated from your home to your power company.
为了使这个例子更生动,我们要求您像一家虚构的电力公司 Edge Energy 的数据工程师一样思考。您的目标是了解、建模和查询 Edge Energy 通信网络中的层次结构。
To bring this example to life, we are asking you to think like a data engineer for a fictitious power company, Edge Energy. Your objective will be to understand, model, and query the hierarchical structure found within Edge Energy’s communication network.
我们建议团队通过三个步骤来解决任何类似这样的新问题:
We advise teams to approach any new problem like this one in three steps:
了解数据。
Understand the data.
使用 GSL 符号构建概念模型。
Build a conceptual model using the GSL notation.
创建数据库模式。
Create the database schema.
接下来的三个部分遵循以下步骤。
The next three sections follow these steps.
Edge Energy 在任何家庭或企业收集的每个读数都会针对几种不同的合规场景进行报告,例如实时审计。该公司需要做好准备的最复杂的问题之一是:如果其中一座通信塔倒塌了怎么办?
Each reading collected by Edge Energy at any home or business is reported for a few different compliance scenarios, like real-time auditing. One of the most complex problems the company has to prepare for is: what if one of the communication towers goes down?
为了帮助您想象这一点,请考虑图 6-7中 Edge Energy 网络的放大快照。
To help you envision this, consider the zoomed-in snapshot of Edge Energy’s network in Figure 6-7.
图 6-7显示了 Edge Energy 的传感器(星号)和通信塔(菱形);我们用橙色突出显示了其中一座塔。最终,我们在接下来的两章中的示例必须回答这个问题:如果橙色塔倒塌,Edge Energy 的传感器数据会发生什么?也就是说,Edge Energy 希望评估塔故障对整个网络传感器数据可访问性的影响,以便公司可以为不同的故障情况做好准备。
Figure 6-7 shows Edge Energy’s sensors (the asterisks) and communication towers (the diamonds); we have highlighted one of the towers in orange. Ultimately, our example across the next two chapters has to answer this question: what would happen to Edge Energy’s sensor data if the orange tower went down? That is, Edge Energy wants to assess the impact of a tower’s failure on the accessibility of sensor data across the entire network so that the company can prepare for different failure scenarios.
这个问题要求我们首先了解单个塔。如果我们能了解一座塔,我们就能了解网络上的任何一座塔。我们在第 7 章末尾得到的答案可能会让你大吃一惊。
The problem requires that we first understand a single tower. If we can understand one tower, we can understand any of the towers on the network. And the answer we get to at the end of Chapter 7 may surprise you.
让我们了解一下如何在 Edge Energy 的传感器和塔网络中构建动态和分层图。
Let’s walk through how a dynamic and hierarchical graph is constructed in Edge Energy’s network of sensors and towers.
在 Edge Energy 的网络中,传感器负责两件事。首先,传感器读取其所分配的住宅或企业的读数。其次,每个传感器都会在一定时间间隔内将其读数传送到网络中的另一个可用点(附近的传感器或发射塔)。目标是让每个读数最终通过该网络传送到发射塔,然后返回 Edge Energy 的监控系统。
In Edge Energy’s network, the sensors are responsible for two things. First, a sensor takes readings of the residence or business to which it is assigned. Second, on a time interval, every sensor communicates its reading to another available point in the network—either a nearby sensor or a tower. The objective is for every reading to eventually pass through this network to a tower and back to Edge Energy’s monitoring system.
在图 6-8中,我们放大了视图来查看西雅图不同地区的网络。
In Figure 6-8 we have zoomed in to look at the network in a different area of Seattle.
您不会在图 6-8中看到数据的层次性质,但您会在我们如何使用数据(即将出现)中看到它。
What you won’t see in Figure 6-8 is the hierarchical nature of the data, but you will see it in how we use the data (coming up).
正如我们最近讨论的那样,使用分层数据的应用程序以两种主要模式查询数据:自下而上或自上而下。通过通信数据的使用方式,其分层结构变得更容易看清。
As we recently talked about, applications that use hierarchical data query the data in two main patterns: from the bottom up or from the top down. It is in how the communication data is used that its hierarchical structure becomes easier to see.
We are going to spend time walking through and understanding our data before we write code to query it.
我们希望使用 Edge Energy 传感器网络中的数据的第一种方法是了解传感器的数据如何到达塔。让我们看一下图 6-9,了解如何在整个网络中共享数据Sensor S以将其读数传递到塔。
The first way we want to use the data in Edge Energy’s sensor network is to understand how the data from a sensor reached a tower. Let’s take a look in Figure 6-9 at how the data from Sensor S was shared throughout the network to pass its reading to a tower.
图 6-9强调了一次遍历:Sensor S一整天内从 到 附近的塔。如果你追踪每条路径,你会发现从 走到Sensor S任何塔都有许多独特的方式。示例路径包括:
Figure 6-9 emphasizes one traversal: from Sensor S to nearby towers over the course of an entire day. If you trace through every walk, you will find many unique ways to walk from Sensor S to any tower. Example paths include:
南 → 西雅图 S → A → FirstHill S → A → C → FirstHill S → A → C → D → FirstHill S → A → C → D → 西湖
S → Seattle S → A → FirstHill S → A → C → FirstHill S → A → C → D → FirstHill S → A → C → D → WestLake
从不同的角度来看,图 6-10显示了图 6-9中的层次结构。
To look at this in a different way, Figure 6-10 shows the hierarchical structures from Figure 6-9.
图 6-10中的数据说明了数据的无界性和层次性。一些路径的Sensor S距离为 1,而其他路径的距离则高达 5。图 6-11显示了如何快速找到此层次结构中路径的距离。
Looking at the data in Figure 6-10 illustrates the unbounded and hierarchical nature of the data. Some paths from Sensor S have a distance of 1 whereas others vary up to a distance of 5. Figure 6-11 shows how you can quickly find the distance of a path in this hierarchy.
Sensor S显示多个塔顶点的层次结构图 6-11显示Sensor S到Seattle塔的距离可以是 1、3、4 或 6。长度为 1 的路径为Sensor S → Seattle。长度为 3 的路径为Sensor S → A → B → Seattle。长度为 4 的路径为Sensor S → A → B → E → Seattle。长度为 6 的路径为Sensor S → A → B → E → F → G → Seattle。
Figure 6-11 shows that the distance from Sensor S to the Seattle tower can be 1, 3, 4, or 6. The path of length 1 is Sensor S → Seattle. The path of length 3 is Sensor S → A → B → Seattle. The path of length 4 is Sensor S → A → B → E → Seattle. The path of length 6 is Sensor S → A → B → E → F → G → Seattle.
通过某种层级,每个传感器的读数最终到达一座塔。
Through some hierarchy, every sensor’s reading ultimately reaches a tower.
在现实世界中,这些传感器可以自由地与附近的任何传感器或塔进行通信。这意味着我们图中的层次结构是动态的并且不断变化。这些动态网络在 Cassandra 中创建了一些最美丽的时间序列数据与图结构混合体。
In the real world, these sensors are free to communicate with any nearby sensor or tower. This means that the hierarchical structures within our graph are dynamic and constantly changing. These dynamic networks create some of the most beautiful mixtures of time series data with graph structures in Cassandra.
现在我们了解了如何从下往上看待它们,让我们反转方向,探索从塔到传感器的动态网络。
Now that we understand how to see them from the bottom up, let’s reverse the direction and explore dynamic networks from the towers down to sensors.
查询数据的第二种方法是从上往下:从塔到传感器。图 6-12放大了我们的示例数据,以显示从塔两步即可到达的数据WestLake。
The second way we will be querying this data is from the top down: from towers to sensors. Figure 6-12 zooms in on our example data to show the data reachable in two steps from the WestLake tower.
WestLake说明连接到塔的第二个传感器邻域图 6-12显示了从塔步行 2 即可到达的传感器WestLake。检查边缘,我们发现传感器A、B、C、F、、 和位于塔的第一个邻域中。在分层数据中,我们称传感器和距根的深度为 1,即 。传感器、、、和位于塔的第二个邻域中。在分层数据中,我们称传感器、、、和距根的深度为 2,即 。EGDWestLakeA, B, C, F, E, G,DWestLakeJKHINWestLakeJKHINWestLake
Figure 6-12 shows the sensors that are reachable in a walk of length 2 from the WestLake tower. Examining the edges, we see that sensors A, B, C, F, E, G, and D are in the first neighborhood of the WestLake tower. In hierarchical data, we say sensors A, B, C, F, E, G, and D have a depth of 1 from the root, WestLake. Sensors J, K, H, I, and N are in the second neighborhood of the WestLake tower. In hierarchical data, we say sensors J, K, H, I, and N have a depth of 2 from the root, WestLake.
在图 6-13中可能更容易看到层次结构和每个传感器的深度。
The hierarchical structure, and each sensor’s depth, may be easier to see in Figure 6-13.
图 6-12和图 6-13显示了相同的数据。我们正在研究如何从层次结构的顶部遍历数据到底部。
Figure 6-12 and Figure 6-13 show the same data. We are looking at how to traverse our data from the top of the hierarchy to the bottom.
要认识到的最重要的概念之一是,这里的例子代表了现实世界的层次结构。它们不是完美的树。这些层次结构很混乱;它们包含循环。
One of the most important concepts to realize is that the example here represents real-world hierarchies. They are not perfect trees. These hierarchies are messy; they contain cycles.
为了了解这一点,让我们讨论一下如何在该数据集及其真实世界版本中创建边。
To see that, let’s talk about how an edge is created in this dataset and in its real-world version.
接下来部分中的查询将会沿着传感器通信层次结构进行。以下规则适用于传感器和塔之间存在边缘:
The queries in the upcoming sections will be walking up and down the sensor communication hierarchies. The following rules apply to the presence of edges between sensors and towers:
边缘从任意传感器开始并到达相邻的传感器或塔。
Edges start from any sensor and go to a neighboring sensor or tower.
不能有循环;传感器不能向自身添加边。
There can be no loops; a sensor cannot add an edge to itself.
循环与周期不同。环路是指从同一顶点开始和结束的边;循环是指从同一顶点开始和结束的一系列边。这些网络中可能存在循环,但绝不会存在环路。
Loops are different from cycles. A loop is an edge that starts and ends at the same vertex; a cycle is a series of edges that start and end at the same vertex. There may be cycles in these network but never loops.
我们在数据集中应用边缘的分层网络来展示 Edge Energy 如何使用应用中的优势:
We apply the hierarchical network of edges in our dataset to show how Edge Energy uses the edges in its application:
边缘链接在一起形成步行道。
Edges chain together to create walks.
步行代表从传感器到塔的通信。
Walks represent communication from a sensor to a tower.
步行从传感器开始,到塔结束,反之亦然。
Walks start at a sensor and end at a tower, and vice versa.
至此,我们已经完成了三个步骤中的第一步。我们正在从理解数据转向查询驱动的数据建模。
At this point, we have completed step one of our three steps. We are moving on from understanding the data to query-driven data modeling.
通过我们的示例及其提供的数据,我们旨在深入了解 Edge Energy 传感器形成的动态网络。我们希望报告用于向塔台共享传感器读数的路径,以便我们了解故障情况。为此,我们将重点解决以下问题:
With our example and the data it provides, we aim to gain insight into the dynamic network formed by Edge Energy’s sensors. We will want to report the paths used to share a sensor’s reading to a tower so that we can understand failure scenarios. To do that, we will focus on addressing the following queries:
传感器的数据通过什么路径将信息传递到塔台?
What path did a sensor’s data follow to pass its information to a tower?
哪些传感器与特定塔进行通信?
What sensors communicated with a specific tower?
塔关闭、丢失或全面故障会造成什么影响?
What is the impact of the shutdown, loss, or general failure of a tower?
结合我们的理解通过数据、上面列出的查询以及前几章中的数据建模建议,我们得出了示例的一个非常基本的数据库模式,如图6-14所示。
Combining our understanding of the data, the queries listed above, and data modeling recommendations from previous chapters, we arrive at a very basic database schema for our example, as shown in Figure 6-14.
图 6-14应用查询驱动建模以及我们的数据建模最佳实践来得出图数据库模式。正如我们在本书中所做的那样,我们创建了两个顶点标签来表示此数据中感兴趣的主要实体:传感器和塔。为了展示传感器如何与 Edge Energy 通信,我们有一个send从传感器顶点标签到塔顶点标签的边标签。为了说明传感器如何相互通信,我们有一个自引用边标签send,它以传感器顶点标签开始和结束。
Figure 6-14 applies query-driven modeling along with our data modeling best practices to arrive at a graph database schema. As we have done throughout this book, we created two vertex labels that represent the main entities of interest in this data: Sensors and Towers. To show how a sensor communicates with Edge Energy, we have an edge label called send from a Sensor vertex label to the Tower vertex label. To illustrate how sensors communicate with each other, we have a self-referencing edge label send that starts and ends with the Sensor vertex label.
回想第 2 章,自引用边标签与循环不同。自引用边标签表示以相同顶点标签开始和结束的架构元素。这与循环不同,循环是数据中的概念,而不是架构中的概念。循环是数据中以顶点开始和结束的边,就像以 开始和结束的边一样Sensor 1。我们的数据中不会有循环,因此传感器将不允许向自己发送信息。
Recall from Chapter 2 that self-referencing edge labels are different from loops. Self-referencing edge labels represent schema elements that start and end at the same vertex label. This is different from a loop, which is a concept in the data, not the schema. Loops are edges in data that start and end with the vertex—like an edge starting and ending at Sensor 1. We will not have loops in our data, and consequently, sensors will not be allowed to send information to themselves.
随附的数据集代表了西雅图大区的真实塔和传感器。对于 Edge Energy 来说,这只是其全球网络的一小部分。
The accompanying dataset represents real towers and sensors across the broader Seattle area. For Edge Energy, this is just one small area of its global network.
数据集中的每个塔都代表一个真实的手机信号塔。每个塔都有一个唯一的标识符、一个名称和一个地理位置。我们在讨论WestLake塔时已经看到了这一点。传感器也是如此。传感器在西雅图地区有一个唯一的标识符和一个有效的地理位置。在我们的例子中,我们一直使用字母来识别传感器,比如Sensor A,但真实数据集中的标识符是整数。
Each tower in the dataset represents a real cell phone tower. Each tower has a unique identifier, a name, and a geo-location. We already saw this when we talked about the WestLake tower. The same is true for the sensors. The sensors have a unique identifier and a valid geo-location around the Seattle area. We have been using letters to identify a sensor in our examples, like Sensor A, but the identifiers in the real dataset are integers.
我们示例中的新功能是能够引用特定顶点的地理位置。我们通过在架构代码中创建点来实现这一点:
A new feature we have in our example is the ability to reference the geo-location of a specific vertex. We do this by creating points in the schema code:
schema.vertexLabel("Sensor").ifNotExists().partitionBy("sensor_name",Text).property("latitude",Double).property("longitude",Double).property("coordinates",Point).create();schema.vertexLabel("Tower").ifNotExists().partitionBy("tower_name",Text).property("latitude",Double).property("longitude",Double).property("coordinates",Point).create();
schema.vertexLabel("Sensor").ifNotExists().partitionBy("sensor_name",Text).property("latitude",Double).property("longitude",Double).property("coordinates",Point).create();schema.vertexLabel("Tower").ifNotExists().partitionBy("tower_name",Text).property("latitude",Double).property("longitude",Double).property("coordinates",Point).create();
我们的示例只需要创建两个边缘标签。我们需要模拟一个传感器,该传感器将信息发送到另一个传感器或塔。架构代码将是:
There are only two edge labels that we need to create for our example. We need to model a sensor sending information to either another sensor or a tower. The schema code will be:
schema.edgeLabel("send").ifNotExists().from("Sensor").to("Sensor").create()schema.edgeLabel("send").ifNotExists().from("Sensor").to("Tower").create()
schema.edgeLabel("send").ifNotExists().from("Sensor").to("Sensor").create()schema.edgeLabel("send").ifNotExists().from("Sensor").to("Tower").create()
让我们看看所有包含的顶点数据文件表 6-1中对每个项目进行了简要说明。
Let’s look at all of the included vertex data files and a brief description for each in Table 6-1.
| 顶点文件 | 描述 |
|---|---|
传感器.csv Sensor.csv |
传感器,每行一个 The sensors, one per line |
塔.csv Tower.csv |
每行一座塔 The towers, one per line |
我们来看一下如何使用 DataStax Bulk Loader 加载顶点数据的示例Tower.csv。 的前五行如表 6-2Tower.csv所示。
Let’s see an example of how to load vertex data with DataStax Bulk Loader by examining Tower.csv. The first five lines of Tower.csv are shown in Table 6-2.
| 塔名称 | 坐标 | 纬度 | 经度 |
|---|---|---|---|
伦顿 Renton |
点(-122.203199 47.47896) POINT (-122.203199 47.47896) |
47.47895812988281 47.47895812988281 |
-122.20320129394 -122.20320129394 |
枫叶 MapleLeaf |
点 (-122.322603 47.69395) POINT (-122.322603 47.69395) |
47.69395065307617 47.69395065307617 |
-122.32260131835 -122.32260131835 |
山湖露台 MountainlakeTerrace |
点 (-122.306926 47.791277) POINT (-122.306926 47.791277) |
47.79127883911133 47.79127883911133 |
-122.30692291259 -122.30692291259 |
林伍德 Lynnwood |
点 (-122.308106 47.828134) POINT (-122.308106 47.828134) |
47.82813262939453 47.82813262939453 |
-122.30810546875 -122.30810546875 |
在附带的加载脚本中,header 同时也作为文件和数据库的映射配置,header 和 DataStax Graph 中的属性名必须一致。
In the accompanying loading scripts, the header doubles as the mapping configuration between the file and the database. The header and the property names in DataStax Graph must match.
我们可以使用命令行批量加载实用程序加载 CSV 文件,如示例 6-1所示。
We can load the CSV file using the command-line bulk loading utility as shown in Example 6-1.
1dsbulk load -url /path/to/Tower.csv2-g tree_dev3-v Tower4-headertrue
1dsbulk load -url /path/to/Tower.csv2-g tree_dev3-v Tower4-headertrue
例 6-1展示了在本地主机上加载顶点数据的最基本方法,就像我们在第 5 章中所做的那样。接下来,让我们看看边数据和加载过程。
Example 6-1 shows the most basic way to load vertex data on your localhost, just like we did in Chapter 5. Next, let’s look at the edge data and loading process.
所有包含的边缘数据文件表 6-3列出了每个条目的简要说明。
All of the included edge datafiles and a brief description for each are listed in Table 6-3.
| 边文件 | 描述 |
|---|---|
传感器_发送_传感器.csv Sensor_send_Sensor.csv |
The |
传感器发送塔.csv Sensor_send_Tower.csv |
The |
让我们通过检查来查看如何使用 DataStax Bulk Loader 加载边缘数据的示例Sensor_send_Sensor.csv。的前五行如表 6-4Sensor_send_Sensor.csv所示。
Let’s see an example of how to load edge data with DataStax Bulk Loader by examining Sensor_send_Sensor.csv. The first five lines of Sensor_send_Sensor.csv are shown in Table 6-4.
| 输出传感器名称 | 时间步长 | 输入传感器名称 |
|---|---|---|
103318117 103318117 |
1 1 |
126951211 126951211 |
1064041 1064041 |
2 2 |
1307588 1307588 |
1035508 1035508 |
2 2 |
1307588 1307588 |
1282094 1282094 |
1 1 |
1031441 1031441 |
表 6-4中最重要的行是标题;标题必须与 DataStax Graph 中的表模式相匹配。DataStax Graph 会为表主键的一部分的边缘属性自动生成不同的列名。表 6-4中的标题行显示了在自引用边缘的情况下,DataStax Graph 如何将out_和附加in_到分区键列的前面。如果您想自己发现这一点,您可以使用 DataStax Studio 中的模式工具或cqlsh检查模式的命名约定。
The most important line in Table 6-4 is the header; the header has to match the table schema in DataStax Graph. DataStax Graph autogenerates different column names for the edge properties that are part of the table’s primary key. The header line in Table 6-4 shows how DataStax Graph appends out_ and in_ to the front of the partition key columns in the case of a self-referencing edge. If you would like to discover this on your own, you can use your schema tools inside of DataStax Studio or cqlsh to inspect the naming conventions of your schema.
您还会看到表 6-4timestep中调用的属性,但我们的架构在数据库中的边上没有此属性。在这种情况下,加载过程中将忽略额外的数据;即使它在数据中,我们也不会在边上出现。timestep
You also see a property called timestep in Table 6-4, but our schema does not have this property on our edges in the database. In this case, the extra data will be ignored during the loading process; we will not end up with timestep on our edges even though it is in the data.
我们将在第 7 章中重新讨论并使用该timestep属性,届时我们将介绍如何将时间应用于数据以及如何在遍历中使用它。现在添加所有这些复杂性对于我们在此示例的开发阶段想要涵盖的内容来说太过复杂了。
We will revisit and use the timestep property in Chapter 7 when we introduce how to apply time to our data and how to use it in your traversals. To add in all of that complexity now is too much for what we want to cover at this point in the development of this example.
我们可以使用命令行批量加载实用程序如示例 6-2所示。
We can load the edge CSV file using the command-line bulk loading utility as shown in Example 6-2.
1dsbulk load -url /path/to/Transactions.csv2-g trees_dev3-e send4-来自传感器5-到传感器6-headertrue
1dsbulk load -url /path/to/Transactions.csv2-g trees_dev3-e send4-from Sensor5-to Sensor6-headertrue
示例 6-2展示了在本地主机上加载边数据的最基本方法,就像我们在第 5 章中看到的那样。随附的脚本展示了如何加载本章和本书所有示例的所有顶点和边数据。请参阅本书 GitHub 存储库中的数据目录,了解每章的数据和加载脚本。
Example 6-2 shows the most basic way to load edge data on your localhost, as we saw in Chapter 5. The accompanying scripts show how to load all vertex and edge data for this chapter and all examples in this book. Please refer to the data directory within this book’s GitHub repository for the data and loading scripts for each chapter.
到目前为止,我们已经完成了本章示例中的三个任务。我们探索了本示例将使用的数据。然后,我们为传感器和塔架构建了一个模型,以跟踪整个传感器网络中的通信。最后,我们加载了数据以用于我们即将进行的查询。
So far, we have accomplished three tasks for our example in this chapter. We explored the data we will be using for this example. Then, we built a model for sensors and towers to trace communication throughout a network of sensors. Last, we loaded the data to use for our upcoming queries.
在图中,查询和使用树结构主要侧重于上下遍历树结构。当我们说向上遍历树结构时,我们指的是从叶子向上走到根。向下遍历树结构的方向相反:从根向下到叶子或叶子。
In graph applications, querying and using tree structures primarily focuses on traversing up and down the tree’s structure. When we say we are traversing up the tree structure, we are talking about walking up from a leaf to the root. Traversing down the tree structure goes in the opposite direction: from the root down to a leaf or leaves.
让我们通过在开发模式下遍历传感器树来解决概念和疑问。我们将首先展示 Edge Energy 如何通过遍历树来跟踪传感器到塔的通信路径。
Let’s iron out the concepts and queries by walking up and down the sensor trees in development mode. We will start with showing how Edge Energy can follow a sensor’s communication path to a tower by walking up the trees.
我们将在本章末尾揭示一种方法比另一种方法更难的原因,为第 7 章做好准备。
We will unveil the reason one way is harder than the other at the end of this chapter, setting the stage for Chapter 7.
接下来的示例应用数据模型来回答 Edge Energy 的查询。我们的第一个问题查询从叶子节点到根节点的数据,以回答以下问题:
The upcoming examples apply the data model to answer the queries for Edge Energy. Our first question queries the data from the leaves up to the root to answer the following:
传感器的数据通过什么路径将信息传递到塔台?
What path did a sensor’s data follow to pass its information to a tower?
我们将这个问题分解为两个步骤:
We are breaking this question down into two steps:
特定传感器将信息发送到了哪里?
Where has a specific sensor sent information to?
这个传感器到任意塔的路径是怎样的?
What was this sensor’s path to any tower?
回答这些问题中的每一个问题都有助于展示如何在分层数据中从叶子到根进行查询。让我们深入了解如何使用 Gremlin 来实现这一点。
Answering each of these questions builds up to showing how to query from leaves to roots in hierarchical data. Let’s dive in and see how to do this with Gremlin.
第一个查询要求探索从给定传感器可访问的数据邻域。我们Sensor 1002688为这个例子选择了这个邻域。我们想从了解第一个邻域开始;示例 6-3显示了查询,示例 6-4显示了结果。
This first query asks to explore the neighborhoods of data accessible from a given sensor. We picked Sensor 1002688 for this example. We want to start with understanding the first neighborhood; Example 6-3 shows the query and Example 6-4 displays the results.
该步骤编译为与等dev.V(vertex)相同的查询。顶点主键中的每个属性都需要一个子句。dev.V().hasLabel(label).has(key, value).has(key, value)…has()
The step dev.V(vertex) compiles to the same query as dev.V().hasLabel(label).has(key, value).has(key, value)… and so on. A has() clause is required for every property in the vertex’s primary key.
1sensor=dev.V().has("Sensor","sensor_name","1002688").// look up the sensor2next()// return the sensor vertex3dev.V(sensor).// look up the sensor4out("send").// walk through all send edges5project("Label","Name").// for each vertex, create map with two keys6by(label).// the value for the first key "Label"7by(coalesce(values("tower_name",// for the 2nd key "Name": if a tower8"sensor_name")))// else, return the sensor_name
1sensor=dev.V().has("Sensor","sensor_name","1002688").// look up the sensor2next()// return the sensor vertex3dev.V(sensor).// look up the sensor4out("send").// walk through all send edges5project("Label","Name").// for each vertex, create map with two keys6by(label).// the value for the first key "Label"7by(coalesce(values("tower_name",// for the 2nd key "Name": if a tower8"sensor_name")))// else, return the sensor_name
{"Label":"Sensor","Name":"1035508"},{"Label":"Tower","Name":"Georgetown"}
{"Label":"Sensor","Name":"1035508"},{"Label":"Tower","Name":"Georgetown"}
示例 6-3和示例 6-4探索了 的第一个邻域Sensor 1002688。第 1 行到第 3 行说明了使用 DataStax Graph 访问和使用顶点对象的另一种方法。第 4 行到第 8 行查询第一个邻域并形成结果集。结果显示1002688向一个传感器和一个塔发送了数据:1035508和Georgetown。这意味着就在附近,并且在整个样本数据范围内Sensor 1002688与这些传感器Sensor 1035508和塔进行通信。Georgetown
Example 6-3 and Example 6-4 explore the first neighborhood for Sensor 1002688. Lines 1 through 3 illustrate another way to access and use vertex objects with DataStax Graph. Lines 4 through 8 query the first neighborhood and shape the result set. The results show that 1002688 sent data to one sensor and one tower: 1035508 and Georgetown. This means that Sensor 1002688 is nearby and communicated with those Sensor 1035508 and the Georgetown tower throughout the entire scope of the sample data.
示例 6-3中的第 3 行引入了一个新概念:使用语法直接查找顶点V(vertex)。我们这样做是为了展示如何将对象存储在应用程序的内存中并在遍历中使用它;这可能在应用程序开发的某个阶段对您有用。
Line 3 in Example 6-3 introduces one new concept: direct vertex lookup with the V(vertex) syntax. We did this to show how to store an object in your application’s memory and use it in a traversal; it might be useful for you at some point in your application’s development.
如果您对应用这些步骤和塑造查询结果感到满意,则可以跳到下一个查询。
If you feel comfortable with applying these steps and shaping the query results, you can skip ahead to the next query.
为了练习,让我们来看看示例 6-3中所示的整形过程。在第 3 行的末尾,我们的遍历器位于顶点上Sensor 1002688。然后我们遍历所有传出send边,到达第 4 行上此传感器的第一个邻域中的任意顶点。这里的技巧是传感器可以向其他传感器或塔发送信息。因此,我们必须准备好使用 Gremlin 中的分支逻辑来处理不同类型的数据。
For practice, let’s walk through the shaping process seen in Example 6-3. At the end of line 3, our traverser is on the vertex for Sensor 1002688. Then we walk through all outgoing send edges to arrive at any vertex in this sensor’s first neighborhood on line 4. The trick here is that a sensor can send information to other sensors or towers. Therefore, we have to prepare for different types of data to process with branching logic in Gremlin.
我们希望结果有效负载是具有以下键的结构化 JSON:Label和Name。我们使用步骤创建此 JSON 对象及其键project("Label", "Name")。第 6 行通过调制器内的步骤Label使用每个顶点的标签填充地图中的键。第 7 行通过不同调制器内的步骤使用分支逻辑填充地图中键的值。label()by()Namecoalesce()by()
We would like the result payload to be structured JSON with the following keys: Label and Name. We create this JSON object and its keys with the project("Label", "Name") step. Line 6 fills the Label keys in our map with each vertex’s label via the label() step within a by() modulator. Line 7 fills the values for the Name key in our map with branching logic via the coalesce() step within a different by() modulator.
该coalesce()步骤示例可分解为以下伪代码:
This example of the coalesce() step can be be broken down into the following pseudocode:
# 伪代码
# 合并(值("tower_name"),值("sensor_name"))
如果(值(“tower_name”)不为无):
返回值(“tower_name”)
别的:
返回值(“传感器名称”)# pseudocode for
# coalesce(values("tower_name"), values("sensor_name"))
if(values("tower_name") is not None):
return values("tower_name")
else:
return values("sensor_name")
Sensor 1002688在我们的数据中,这是一个有趣的例子,因为它直接与塔和传感器通信。然而,除了第一个邻域之外,我们可以找到更多将此传感器连接到塔的路径。让我们使用与之前相同的查询来检查的第二个邻域Sensor 1002688:
Sensor 1002688 makes for an interesting example in our data because it directly communicates to towers and sensors. Beyond the first neighborhood, however, we can find more paths that connect this sensor to a tower. Let’s use the same query as before to examine the second neighborhood of Sensor 1002688:
1sensor=dev.V().has("Sensor","sensor_name","1002688").// look up the sensor2next()// return the sensor vertex3dev.V(sensor).// look up a sensor4out("send").// walk to all vertices in the first neighborhood5out("send").// walk to all vertices in the second neighborhood6project("Label","Name").// for each vertex, create a map with 2 keys7by(label).// the value for the first key is the label8by(coalesce(values("tower_name",// if a tower, return tower_name9"sensor_name")))// else return sensor_name
1sensor=dev.V().has("Sensor","sensor_name","1002688").// look up the sensor2next()// return the sensor vertex3dev.V(sensor).// look up a sensor4out("send").// walk to all vertices in the first neighborhood5out("send").// walk to all vertices in the second neighborhood6project("Label","Name").// for each vertex, create a map with 2 keys7by(label).// the value for the first key is the label8by(coalesce(values("tower_name",// if a tower, return tower_name9"sensor_name")))// else return sensor_name
{"Label":"Sensor","Name":"1061624"},{"Label":"Sensor","Name":"1307588"},{"Label":"Tower","Name":"WhiteCenter"}
{"Label":"Sensor","Name":"1061624"},{"Label":"Sensor","Name":"1307588"},{"Label":"Tower","Name":"WhiteCenter"}
这些结果表明,距离 第二个邻域Sensor 1002688发现了另一座塔,WhiteCenter。让我们继续走出去,检查距离 第三个邻域Sensor 1002688——参见示例 6-5和示例 6-6。
These results show that the second neighborhood away from Sensor 1002688 discovers another tower, WhiteCenter. Let’s continue walking out and inspect the third neighborhood from Sensor 1002688—see Example 6-5 and Example 6-6.
1sensor=dev.V().has("Sensor","sensor_name","1002688").// look up the sensor2next()// return the sensor vertex3dev.V(sensor).// look up a sensor4out("send").// walk to all vertices in the first neighborhood5out("send").// walk to all vertices in the second neighborhood6out("send").// walk to all vertices in the third neighborhood7project("Label","Name").// for each vertex, create a map with 2 keys8by(label).// the value for the first key is the label9by(coalesce(values("tower_name",// if a tower, return tower_name10"sensor_name")))// else return sensor_name
1sensor=dev.V().has("Sensor","sensor_name","1002688").// look up the sensor2next()// return the sensor vertex3dev.V(sensor).// look up a sensor4out("send").// walk to all vertices in the first neighborhood5out("send").// walk to all vertices in the second neighborhood6out("send").// walk to all vertices in the third neighborhood7project("Label","Name").// for each vertex, create a map with 2 keys8by(label).// the value for the first key is the label9by(coalesce(values("tower_name",// if a tower, return tower_name10"sensor_name")))// else return sensor_name
{"Label":"Sensor","Name":"1064041"},{"Label":"Sensor","Name":"1237824"},{"Label":"Sensor","Name":"1237824"},{"Label":"Sensor","Name":"1002688"//Cycle},{"Label":"Sensor","Name":"1035508"//Cycle}
{"Label":"Sensor","Name":"1064041"},{"Label":"Sensor","Name":"1237824"},{"Label":"Sensor","Name":"1237824"},{"Label":"Sensor","Name":"1002688"//Cycle},{"Label":"Sensor","Name":"1035508"//Cycle}
图 6-15可视化了前三个邻域的所有数据Sensor 1002688,并用粗边突出显示了数据中的循环。
Figure 6-15 visualizes all of the data from the first three neighborhoods of Sensor 1002688 and highlights the cycles in the data with thick edges.
Sensor 1002688图 6-15显示了 中第一、第二和第三个邻域内的顶点和边。仔细检查图 6-15Sensor 1002688中加粗的边,可以发现两个循环:
Figure 6-15 displays the vertices and edges that are within the first, second, and third neighborhoods from Sensor 1002688. Close inspection of the bolded edges in Figure 6-15 finds two cycles:
1035508 → 1307588 → 1035508 1002688 → 1035508 → 1307588 → 1002688
1035508 → 1307588 → 1035508 1002688 → 1035508 → 1307588 → 1002688
我们数据中的循环将是一个问题,我们将在下一个查询中解决它。
The cycles in our data will be a problem that we will resolve in the next query.
编写多个 Gremlin 语句来硬编码您离开起始传感器的步数并不是编写查询的理想方式。相反,我们希望从开始Sensor 1002688并探索所有通信路径,直到其中一条路径在其根部找到一个塔顶点。
Writing multiple Gremlin statements to hardcode the number of steps you walk away from the starting sensor is not an ideal way to write a query. Instead, we want to start at Sensor 1002688 and explore all communication paths until one of them finds a tower vertex at its root.
我们可以使用until().repeat()Gremlin 中的模式来实现这一点。使用repeat()withuntil()可让您在给定某些中断条件的情况下循环遍历。您可以使用until()步骤指定中断条件。 如果until()位于 之前repeat(),则为while/do循环。如果until()位于 之后repeat(),则为do/while循环(参见图 6-16)。
We can achieve this with the until().repeat() pattern in Gremlin. The use of repeat() with until() gives you the ability to loop over traversals given some breaking condition. You specify the breaking condition with the until() step. If until() comes before repeat(), it is while/do looping. If until() comes after repeat(), it is do/while looping (see Figure 6-16).
repeat()步骤until()例 6-7展示了如何将此模式应用于例 6-5until().repeat()中的想法,其中的模式小精灵:
Example 6-7 shows how to apply this pattern to the idea from Example 6-5 with the until().repeat() pattern in Gremlin:
1sensor=dev.V().has("Sensor","sensor_name","1002688").2next()3dev.V(sensor).// look up the sensor4until(hasLabel("Tower")).// until you reach a tower5repeat(out("send"))// keep walking out the send edge
1sensor=dev.V().has("Sensor","sensor_name","1002688").2next()3dev.V(sensor).// look up the sensor4until(hasLabel("Tower")).// until you reach a tower5repeat(out("send"))// keep walking out the send edge
示例 6-7中的查询无法及时完成。这是因为当您从上到下走到1002688任何一座塔时都会发现循环。
The query in Example 6-7 will not finish in a timely manner. This is due to the cycles found as you walk from 1002688 up to any tower.
如图6-15所示,我们希望从结果中删除循环。Gremlin 中有以下步骤:simplePath()。
As we saw in Figure 6-15, we want to remove the cycles from our results. There is a step for this in Gremlin: simplePath().
当遍历器不重复其在图中经过的路径很重要时,simplePath()应使用步骤。分析遍历器的路径信息,如果路径中有重复的对象,则过滤遍历器。
When it is important that a traverser not repeat its path through the graph, the simplePath() step should be used. The path information of the traverser is analyzed, and if the path has repeated objects in it, the traverser is filtered.
它确实就是那么...简单。
It really is that…simple.
我们要做的就是在步骤模式simplePath()中添加步骤repeat()。这将插入一个过滤器,如果遍历器的历史记录包含以下内容,则该过滤器会将其排除一个循环。示例 6-8显示了 Gremlin 代码,示例 6-9显示了前三个结果。
All we have to do is add the simplePath() step within the repeat() step pattern. This will insert a filter that eliminates a traverser if its history contains a cycle. Example 6-8 displays the Gremlin code, and Example 6-9 shows the first three results.
1sensor=dev.V().has("Sensor","sensor_name","1002688").2next()3dev.V(sensor).// look up a sensor4until(hasLabel("Tower")).// until you reach a tower5repeat(out("send").// keep walking out the send edge6simplePath())// remove cycles
1sensor=dev.V().has("Sensor","sensor_name","1002688").2next()3dev.V(sensor).// look up a sensor4until(hasLabel("Tower")).// until you reach a tower5repeat(out("send").// keep walking out the send edge6simplePath())// remove cycles
{"id":"dseg:/Tower/Georgetown","label":"Tower","type":"vertex","properties":{}},{"id":"dseg:/Tower/WhiteCenter","label":"Tower","type":"vertex","properties":{}},{"id":"dseg:/Tower/RainierValley","label":"Tower","type":"vertex","properties":{}},...
{"id":"dseg:/Tower/Georgetown","label":"Tower","type":"vertex","properties":{}},{"id":"dseg:/Tower/WhiteCenter","label":"Tower","type":"vertex","properties":{}},{"id":"dseg:/Tower/RainierValley","label":"Tower","type":"vertex","properties":{}},...
示例 6-7和示例 6-8 的唯一区别是simplePath第 6 行的 的使用。从示例 6-9中我们可以看出,前三座被发现的塔分别是Georgetown、WhiteCenter和RainierValley。在我们的应用程序中,我们想知道的不只是发现了哪些塔。我们想知道从Sensor 1002688到 塔的路径。
The only change from Example 6-7 to Example 6-8 is the use of simplePath on line 6. We can see from Example 6-9 that the first three discovered towers are Georgetown, WhiteCenter, and RainierValley. In our application, we want to know more than just which towers were found. We want to know the path from Sensor 1002688 to the tower.
这将带我们进入 Gremlin 的最后一个步骤以及本节的主题:path()。
This brings us to our last Gremlin step and topic for this section: path().
path()步骤并操作其数据结构path() step and manipulating its data structure让我们讨论一下path()Gremlin 中的步骤做了什么。在图遍历中处理数据时,您会移动数据。Gremlinpath()中的这一步骤可让您访问遍历器处理过的所有数据,从而了解您去过的位置的历史记录。
Let’s talk about what the path() step in Gremlin does. As you process data in a graph traversal, you are moving around your data. The path() step in Gremlin gives you access to the history of where you have been by providing access to all data that has been processed by a traverser.
该path()步骤(映射)检查并返回遍历器的完整历史记录。
The path() step (map) examines and returns the full history of a traverser.
这大致就像当您从一个地方移动到另一个地方时在图周围留下面包屑一样。
This is roughly like leaving breadcrumbs around your graph as you move from place to place.
我们在示例 6-10中介绍该path()步骤,并在示例 6-11中显示结果。
We introduce the path() step in Example 6-10 and display the results in Example 6-11.
1sensor=dev.V().has("Sensor","sensor_name","1002688").2next()3dev.V(sensor).4until(hasLabel("Tower")).// until you reach a tower5repeat(out("send").// keep walking out the send edge6simplePath()).// remove cycles7path().// all objects will be towers; get their full history8by(coalesce(values("tower_name",// if the vertex in the path is a tower9"sensor_name")))// else the value from a sensor vertex
1sensor=dev.V().has("Sensor","sensor_name","1002688").2next()3dev.V(sensor).4until(hasLabel("Tower")).// until you reach a tower5repeat(out("send").// keep walking out the send edge6simplePath()).// remove cycles7path().// all objects will be towers; get their full history8by(coalesce(values("tower_name",// if the vertex in the path is a tower9"sensor_name")))// else the value from a sensor vertex
让我们来看看示例 6-10中的新步骤。和以前一样,第 1 行到第 6 行从传感器开始,沿着send边走到任何塔,只考虑非循环路径。然后,对于所有可到达的塔,path()第 7 行的步骤要求每个遍历器提供其通过数据的完整路径。第 8 行使用by()调制器来指示我们希望如何查看该数据:如果顶点是塔,我们希望查看tower_name,否则我们希望查看sensor_name。
Let’s walk through the new steps of Example 6-10. As before, lines 1 through 6 start at a sensor and walk through the send edges to any tower, considering only noncyclic paths. Then, for all reachable towers, the path() step on line 7 asks each traverser for its full path through the data. Line 8 uses a by() modulator to indicate how we want to see that data: we want to see the tower_name if the vertex is a tower, or else we want to see the sensor_name.
例 6-11展示了例 6-10的前三个结果。我们看到了图 6-15中绘制的两条路径。
Example 6-11 shows the first three results of Example 6-10. We see two of the paths we drew in Figure 6-15.
{"labels":[[],[]],"objects":["1002688","Georgetown"]},{"labels":[[],[],[]],"objects":["1002688","1035508","WhiteCenter"]},{"labels":[[],[],[],[]],"objects":["1002688","1035508","1061624","1237824","RainierValley"]},...
{"labels":[[],[]],"objects":["1002688","Georgetown"]},{"labels":[[],[],[]],"objects":["1002688","1035508","WhiteCenter"]},{"labels":[[],[],[],[]],"objects":["1002688","1035508","1061624","1237824","RainierValley"]},...
例 6-11中的结果显示了从 出发到达塔的三种不同方式Sensor 1002688。前两条路径证实了我们在穿过 的第一和第二个邻域时发现的情况1002688;我们只是看到了不同结构的数据:["1002688", "1035508", "WhiteCenter"]。此符号表示在遍历中找到了以下路径:
The results in Example 6-11 show three different ways in which you can arrive at towers by starting from Sensor 1002688. The first two paths confirm what we discovered as we walked through the first and second neighborhoods of 1002688; we just see the data in a different structure: ["1002688", "1035508", "WhiteCenter"]. This notation means the following path was found in the traversal:
1002688 → 1035508 → 白色中心
1002688 → 1035508 → WhiteCenter
随附的 Studio NotebookSensor 1002688中展示了从塔顶点走到塔顶的一千多种不同方式。
More than a thousand different ways to walk from Sensor 1002688 to a tower vertex are shown in the accompanying Studio Notebook.
使用时,path()您必须深刻理解两件事:如何使用 分配标签as()以及如何使用 塑造结果by()。让我们详细了解每个主题。
When you use path() there are two things you must understand deeply: how to assign labels with as() and how to shape the results with by(). Let’s go through each of these topics in detail.
as()as()数据结构有两个键path():labels和objects。使用步骤为路径对象创建标签as()。本质上,您是在为路径中正在处理的数据分配一个变量名。我们没有as()在查询的第一个版本中使用该步骤,因此示例 6-7labels中结果有效负载中的键不包含任何数据。
There are two keys to the path() data structure: labels and objects. A label is created for a path object with the as() step. Essentially, you are assigning a variable name to the data you are processing in your path. We didn’t use the as() step in the first version of our query, so the labels key in the result payload in Example 6-7 contained no data.
现在让我们使用as()步骤为示例 6-12中的路径数据结构分配变量名称,然后我们将在示例 6-13中重新检查生成的有效负载。
Let’s use the as() step now to assign variable names to our path data structure in Example 6-12, and then we’ll reinspect the resulting payload in Example 6-13.
1sensor=dev.V().has("Sensor","sensor_name","1002688").2next()3dev.V(sensor).4as("start").// label 1002688 as "start"5until(hasLabel("Tower")).6repeat(out("send").7as("visited").// label each vertex on the path as "visited"8simplePath()).9as("tower").// label the end of the path as "tower"10path().11by(coalesce(values("tower_name",12"sensor_name")))
1sensor=dev.V().has("Sensor","sensor_name","1002688").2next()3dev.V(sensor).4as("start").// label 1002688 as "start"5until(hasLabel("Tower")).6repeat(out("send").7as("visited").// label each vertex on the path as "visited"8simplePath()).9as("tower").// label the end of the path as "tower"10path().11by(coalesce(values("tower_name",12"sensor_name")))
{"labels":[["start"],["visited","tower"]],"objects":["1002688","Georgetown"]},{"labels":[["start"],["visited"],["visited","tower"]],"objects":["1002688","1035508","WhiteCenter"]},{"labels":[["start"],["visited"],["visited"],["visited","tower"]],"objects":["1002688","1035508","1061624","1237824","RainierValley"]},...
{"labels":[["start"],["visited","tower"]],"objects":["1002688","Georgetown"]},{"labels":[["start"],["visited"],["visited","tower"]],"objects":["1002688","1035508","WhiteCenter"]},{"labels":[["start"],["visited"],["visited"],["visited","tower"]],"objects":["1002688","1035508","1061624","1237824","RainierValley"]},...
例 6-12展示了该步骤如何在数据结构的键as()内引入值。labelspath()labels和内的值objects具有 1:1 映射。让我们再次查看示例 6-13中的第二个示例,以了解标签如何映射到路径:
Example 6-12 shows how the as() step introduces values within the labels key of the path() data structure. The values within labels and objects have a 1:1 mapping. Let’s look again at the second example from Example 6-13 to understand how the labels map to the path:
{
“标签”:[[“开始”],[“访问”],[“访问”,“塔”]],
“对象”:[“1002688”,“1035508”,“WhiteCenter”]
}{
"labels": [["start"], ["visited"], ["visited", "tower"]],
"objects": ["1002688", "1035508", "WhiteCenter"]
}
值["start"]映射到1002688
The value ["start"] maps to 1002688
值["visited"]映射到1035508
The value ["visited"] maps to 1035508
值["visited", "tower"]映射到WhiteCenter
The value ["visited", "tower"] maps to WhiteCenter
我们可以通过回顾示例 6-12中的查询来确认此映射。我们用 标记起始传感器as("start")。步骤中访问的每个顶点repeat(out("send"))都用 标记as("visited")。最后,由于第 5 行的条件过滤器,只有塔被传递到第 9 行:until(hasLabel("Tower"))。因此,任何塔顶点都将从第 9 行收到第二个标签,即as("tower")。
We can confirm this mapping by looking back to our query in Example 6-12. We labeled the starting sensor with as("start"). Each vertex that was accessed within the repeat(out("send")) step was labeled with as("visited"). Last, only towers are passed to line 9 due to the conditional filter from line 5: until(hasLabel("Tower")). Therefore, any tower vertex will receive a second label from line 9 with as("tower").
使用功能as("<some_label>")很强大,因为我们能够使用path()步骤的数据结构为生成的有效负载提供特异性。
Using as("<some_label>") is powerful because we are able to use the path() step’s data structure to provide specificity to the resulting payload.
path()在我们继续讨论其他查询之前,还有最后一个概念需要详细介绍。
There is one last concept to detail about using path() before we move on to other queries.
path()结果by()path() results with by()示例 6-12by()中的使用 允许您对路径中的每个对象执行操作或另一个步骤。在我们的示例中,我们希望返回路径中每个顶点的主键。但是,顶点的标签可能是塔或传感器。因此,我们在by()调制器中添加了一个条件,以一种方式处理塔顶点,以另一种方式处理传感器顶点。
The use of by() in Example 6-12 allows you to perform an operation, or another step, to each object in the path. In our example, we wanted to return the primary key for each vertex in the path. However, the vertex’s label could be a tower or a sensor. Therefore, we added a condition within the by() modulator to process tower vertices one way and sensor vertices another way.
在格式化 元素时path(),by()Gremlin 中的调制器以循环方式应用,这意味着它们以循环顺序应用于遍历对象。在这种情况下,有两个by()步骤:
When formatting the elements of path(), the by() modulators in Gremlin are applied in a round-robin fashion, meaning they are applied to the traversal objects in a cyclical order. In a case in which there are two by() steps:
第一步by()对第一个遍历对象进行操作
The first by() step operates on the first traversal object
第二步by()对第二个遍历对象进行操作
The second by() step operates on the second traversal object
回到第一步by()进行第三个遍历对象
Back to the first by() step for the third traversal object
回到第二步by()进行第四个遍历对象
Back to the second by() step for the fourth traversal object
等等…
And so on…
在本例中,路径中的所有对象都是顶点,因此我们只需要创建一个by()调制器来处理顶点对象。您将在下一章中看到需要多个by()调制器的示例,因为我们在路径的数据结构中处理顶点和边。
In the example here, all of the objects in the path were vertices, so we needed to create only one by() modulator to handle vertex objects. You will see examples in the next chapter in which we need multiple by() modulators because we are processing both vertices and edges in our path’s data structure.
本节中的所有查询和代码都旨在教您如何在层次图中从叶子走到根。您可以将其视为从树的底部走到顶部。
All of the queries and code in this section were designed to teach you how to walk from leaves to roots in a hierarchical graph. You can think of this as walking from the bottom of your tree to its top.
一旦到达山顶,您可能想走回去。那么接下来让我们探索如何从一座塔开始,走到与之相连的传感器,以及您在途中会遇到的各种概念。
Once at the top, you may want to walk back down. So let’s next explore how to start at a tower and walk down to the sensors connected to it and the various concepts you will encounter along the way.
接下来的例子提出了一个问题,我们无法利用现有的信息来解决。我们特意设计了这种体验,为第 7 章中的生产技巧做好准备。
The upcoming examples build up to a question that we cannot resolve with the information we have. We designed this experience on purpose to set the stage for the production tips in Chapter 7.
Edge Energy 必须保持对其网络拓扑的了解。它需要始终了解单个塔正在处理哪个传感器的数据。
Edge Energy has to maintain an understanding of its network’s topology. It needs to know, at all times, which sensor’s data an individual tower is processing.
了解哪些传感器连接到特定塔有助于回答有关此动态通信网络的两个重要问题。它有助于 Edge Energy 了解特定塔是否超载或未充分利用。我们将通过回答本节中的以下问题来帮助 Edge Energy 了解其网络:
Knowing which sensors connect to a particular tower helps answer two important questions about this dynamic communication network. It helps Edge Energy understand whether a specific tower is overloaded or underutilized. We are going to help Edge Energy understand its network by answering the following questions in this section:
首先,我们需要在数据中找到一个有趣的塔来探索。
First, we need to find an interesting tower to explore in our data.
哪些传感器直接与该塔相连?
Which sensors have directly connected to that tower?
从该塔上,找到所有与其相连的传感器。
From that tower, find all sensors that have connected to it.
我们的示例数据能够回答这些场景的问题。但是,我们没有足够的信息来回答问题 3 的全部范围——只是其中的一部分。弄清楚如何完整地回答问题 3 将是第 7 章的目的。
Our example data has the ability to answer the questions for these scenarios. However, we do not have enough information to answer the whole scope of question 3—just part of it. Figuring out how to answer question 3 in its entirety will be the purpose of Chapter 7.
让我们通过第一个问题继续拓展我们的查询和对 Gremlin 查询语言的了解。
Let’s continue to develop our queries and knowledge of the Gremlin query language with our first question.
我们要做的第一件事就是在我们的图中找到一个具有有趣连通性的塔。
The first thing we want to do is find a tower that has interesting connectivity in our graph.
我们为什么一开始就这么做?
Why are we doing this right out of the gate?
当我们开始处理新数据时,我们会运行几个查询来更好地理解它。请记住,这不是我们会投入生产的东西。这是我们出于教育目的需要做的事情,以便找到有趣的数据来处理。
When we start playing with new data, we run a couple queries to understand it better. Keep in mind, this is not something we would put in production. This is something we needed to do for educational purposes to find interesting data to work with.
为了找到一个有趣的塔,我们需要处理图中的所有塔,然后根据传入边的数量对塔进行排序。然后,我们需要具有最高度的塔的主键。让我们在示例 6-14中查看实现此目的的 Gremlin 查询。
So that we can find an interesting tower to work with, we will want to process all towers in our graph and then order the towers according to the number of incoming edges. Then we want the primary key of the tower with the highest degree. Let’s take a look in Example 6-14 at the Gremlin query that achieves this.
示例 6-14中的查询仅用于探索和开发目的。它在分布式系统中运行成本很高,因为它会对塔和每个塔的边缘表进行全表扫描。
The query in Example 6-14 is for exploration and development purposes only. It is expensive to run in a distributed system because it performs a full table scan of the towers and of each tower’s edge table.
1dev.V().hasLabel("Tower").// for all towers2group("degreeDistribution").// create a map object3by(values("tower_name")).// the key for the map: tower_name4by(inE("send").count()).// the value for each entry: its degree5cap("degreeDistribution").// barrier step in Gremlin to fill the map6order(Scope.local).// order the entries within the map object7by(values,Order.desc)// sort by values, decreasing
1dev.V().hasLabel("Tower").// for all towers2group("degreeDistribution").// create a map object3by(values("tower_name")).// the key for the map: tower_name4by(inE("send").count()).// the value for each entry: its degree5cap("degreeDistribution").// barrier step in Gremlin to fill the map6order(Scope.local).// order the entries within the map object7by(values,Order.desc)// sort by values, decreasing
在示例 6-14中,我们构建了一个图,表示图中塔顶点的度分布。group()第 2 行的步骤创建了一个名为 的地图 对象degreeDistribution。我们需要按照group()步骤定义映射的键和值。by()第 3 行的调制器tower_name将其定义为此映射中任何条目的键。第 4 行表示与特定项关联的值tower_name将是该塔的传入边的总数。
In Example 6-14, we construct a map that represents the degree distribution of the tower vertices in our graph. The group() step on line 2 creates a map object called degreeDistribution. We need to follow the group() step with definitions for the map’s keys and values. The by() modulator on line 3 defines that tower_name will be the key for any entry in this map. Line 4 indicates that the value associated to a specific tower_name will be the total number of incoming edges to that tower.
第 5 行引入了 Gremlin 的一个新概念 — 屏障步骤:
Line 5 introduces a new concept in Gremlin—barrier steps:
障碍步骤强制遍历管道完成到该点之后再继续。
Barrier steps force the traversal pipeline to complete up until that point before continuing.
示例 6-14cap()中第 5 行的使用是 Gremlin 中屏障的一个例子。在这里,cap()迭代遍历直到该步骤并将具有名称的对象传递degreeDistribution到管道中的下一步。 我们在上一章中提到,局部作用域对对象内的元素进行排序,而全局作用域则对遍历管道中的所有对象进行排序。我们在第 6 行再次看到了这一作用;对order(Scope.local)地图对象内的元素进行排序degreeDistribution。
The use of cap() on line 5 in Example 6-14 is an example of a barrier in Gremlin. Here, cap() iterates the traversal up until that step and passes the object with the name degreeDistribution into the next step in the pipeline. We mentioned in the last chapter that local scope orders elements within an object, whereas global scope would order all objects in a traversal pipeline. We see this in action again in line 6; order(Scope.local) orders the elements within the map object degreeDistribution.
最后,示例 6-14中的第 7 行提供了此排序的规则:我们希望根据映射中的值进行降序排列。结果示例如下:
Finally, line 7 in Example 6-14 provides the rule for this ordering: we want descending order according to the values in the map. A sample of the results is:
{"Georgetown":"7","WhiteCenter":"7","PioneerSquare":"6","InternationalDistrict":"6","WestLake":"5","RainierValley":"5","HallerLake":"4","SewardPark":"4","BeaconHill":"4",...}
{"Georgetown":"7","WhiteCenter":"7","PioneerSquare":"6","InternationalDistrict":"6","WestLake":"5","RainierValley":"5","HallerLake":"4","SewardPark":"4","BeaconHill":"4",...}
我们发现了一些有用的塔,所以我们就选一个吧。我们看到乔治城有七个传感器;我们来决定哪些传感器直接连接到该塔。
We found a few useful towers, so let’s pick one. We see Georgetown has seven sensors; let’s determine which ones connected directly to that tower.
我们首先查询塔和与其直接相连的传感器。我们可以按照示例 6-12中从传感器操作时所采用的相同模式进行操作:
We’ll start by querying a tower and the sensors that have directly connected to it. We can follow the same pattern we did in Example 6-12 when we were working from sensors:
1sensor=dev.V().has("Sensor","sensor_name","1002688").next()2dev.V(sensor).3out("send").4project("Label","Name").5by(label).6by(coalesce(values("tower_name","sensor_name")))
1sensor=dev.V().has("Sensor","sensor_name","1002688").next()2dev.V(sensor).3out("send").4project("Label","Name").5by(label).6by(coalesce(values("tower_name","sensor_name")))
这次,我们想从一座塔开始,并访问来自传感器的传入通信。我们可以将示例 6-12中的查询更改为示例 6-15中所示的查询。此查询的结果如下。
This time, we want to start at a tower and access its incoming communication from sensors. We can change the query in Example 6-12 to the query shown in Example 6-15. The results of this query follow.
tower=dev.V().has("Tower","tower_name","Georgetown").next()// get Georgetowndev.V(tower).// look up Georgetownin("send").// traverse in to sensorsproject("Label","Name").// create a map with two keysby(label).// of the values for "Label"by(values("sensor_name"))// the values for "Name"
tower=dev.V().has("Tower","tower_name","Georgetown").next()// get Georgetowndev.V(tower).// look up Georgetownin("send").// traverse in to sensorsproject("Label","Name").// create a map with two keysby(label).// of the values for "Label"by(values("sensor_name"))// the values for "Name"
{"Label":"Sensor","Name":"1002688"},{"Label":"Sensor","Name":"1027840"},{"Label":"Sensor","Name":"1306931"},...
{"Label":"Sensor","Name":"1002688"},{"Label":"Sensor","Name":"1027840"},{"Label":"Sensor","Name":"1306931"},...
示例 6-15的结果显示Sensor 1002688连接到Georgetown,这是我们预期看到的结果。尽管我们没有在本文中显示所有七个,但Studio Notebook中的完整结果显示Georgetown有七个传感器直接连接到它。
The results of Example 6-15 show that Sensor 1002688 connected to Georgetown, a result we expected to see. Even though we didn’t show all seven in this text, the full results in the Studio Notebook show that Georgetown has seven sensors that directly connected to it.
Edge Energy 需要知道所有使用该塔进行通信的传感器。我们已经知道Sensor 1002688有一个来自 的传入边缘1307588。这让我们不禁要问,还有多少其他传感器正在使用网络将其信息发送到Georgetown?
Edge Energy needs to know all sensors that use this tower for communication. We already know that Sensor 1002688 has an incoming edge from 1307588. This leads us to ask, how many other sensors are using the network to send their information to Georgetown?
为了回答这个问题,我们需要从这座塔开始递归遍历所有传入边,直到找到这棵通信树中的所有传感器。本章的下一节也是最后一节将使用repeat()/until()上一节中的步骤从这座塔向下遍历所有传感器。
To answer that question, we will want to walk recursively from this tower through all incoming edges until we have found all sensors in this tree of communication. The next and last section of this chapter applies the use of repeat()/until() from the last section to walk from this tower down to all sensors.
我们已经通过查询数据完成了几个部分的工作。最后一个问题是回答更大、更复杂的问题所需的最后一个查询:如果一座塔倒塌会发生什么?
We have been working through querying through our data for a few sections. This last question is the final query needed to answer the question of our larger, complex problem: what happens if a tower fails?
解决最后一个查询的逻辑方法实际上行不通,但我们还是会向您展示它,因为这是每个人都会尝试的合乎逻辑的下一步;我们一直都看到它。我们鼓励通过尝试合乎逻辑的下一步来学习,所以这正是我们要做的。
The logical way to approach this last query won’t actually work, but we are going to show it to you anyway because it is the logical next step that everyone tries; we see it all the time. We encourage learning through trying logical next steps, so that is exactly what we are about to do.
找到有效的 Gremlin 查询模式并将其应用于新问题是很常见的;这就是我们正在讨论的模式,它将导致对新问题产生错误的解决方案。
It is very common to find patterns of working Gremlin queries and apply them to new problems; this is the pattern we are talking about that will lead to a faulty solution to our new question.
让我们回顾一下我们如何递归地从传感器走到塔:
Let’s take a look back at how we recursively walked up from sensors to towers:
dev.V(sensor).// look up a sensoruntil(hasLabel("Tower")).// until you reach a towerrepeat(out("send").// keep walking out the send edgesimplePath())// remove cycles
dev.V(sensor).// look up a sensoruntil(hasLabel("Tower")).// until you reach a towerrepeat(out("send").// keep walking out the send edgesimplePath())// remove cycles
合乎逻辑的下一步是将该查询转换为完全相反的操作:从塔楼走到传感器。让我们应用相同的模式,但切换开始和结束的对象类型。在示例 6-16中,我们从塔楼开始,然后递归走到传感器。
The logical next step is to transform that query to do exactly the reverse: walk from towers to sensor. Let’s apply the same pattern but switch the type of objects we start and end with. In Example 6-16, we start with towers and recursively walk to sensors.
tower=dev.V().has("Tower","tower_name","Georgetown").next()// get Georgetowndev.V(tower).// look up a toweruntil(hasLabel("Sensor")).// until you reach a sensorrepeat(__.in("send").// need to use the Anonymous traversal: __.simplePath())// remove cycles
tower=dev.V().has("Tower","tower_name","Georgetown").next()// get Georgetowndev.V(tower).// look up a toweruntil(hasLabel("Sensor")).// until you reach a sensorrepeat(__.in("send").// need to use the Anonymous traversal: __.simplePath())// remove cycles
示例 6-16需要 Gremlin 中的一个新步骤,称为匿名遍历。在 Groovy 中,in()是一个保留关键字,DataStax Studio 使用 Gremlin 的 Groovy 变体来开发遍历。因此,对于我们的示例,Gremlin 步骤必须以匿名遍历作为前缀。完整的结果负载如示例 6-17in()所示。
Example 6-16 requires a new step in Gremlin called the Anonymous traversal. In Groovy, in() is a reserved keyword, and DataStax Studio uses the Groovy variant of Gremlin for developing traversals. Therefore, the in() Gremlin step must be prefixed with the Anonymous traversal for our example. The full result payload is shown in Example 6-17.
匿名遍历__.用于解决 Gremlin 的许多变体与保留的语言特定关键字,例如in、as或。有关详细信息,values请参阅Apache TinkerPop 文档您选择的编码语言。
The Anonymous traversal __. is used to resolve many variants of Gremlin that have clashes with reserved language-specific keywords such as in, as, or values. Refer to the Apache TinkerPop documentation for specifics within your coding language of choice.
{"id":"dseg:/Sensor/1002688","label":"Sensor","type":"vertex","properties":{}},{"id":"dseg:/Sensor/1027840","label":"Sensor","type":"vertex","properties":{}},{"id":"dseg:/Sensor/1306931","label":"Sensor","type":"vertex","properties":{}}...
{"id":"dseg:/Sensor/1002688","label":"Sensor","type":"vertex","properties":{}},{"id":"dseg:/Sensor/1027840","label":"Sensor","type":"vertex","properties":{}},{"id":"dseg:/Sensor/1306931","label":"Sensor","type":"vertex","properties":{}}...
等一下。检查示例 6-17中的完整结果负载表明,示例 6-16中的查询仅找到了来自塔的第一个邻域的相同的七个传感器。
Wait a second. Inspecting the full result payload in Example 6-17 reveals that the query from Example 6-16 found only the same seven sensors from the tower’s first neighborhood.
这不是我们想要的。
This is not what we want.
示例 6-16中的查询没有给出我们想要的结果,因为它有一个停止条件,即第 2 行上的任何传感器顶点都满足until(hasLabel("Sensor"))。相反,我们希望以递归方式遍历任意深度,直到找到所有传感器。让我们删除此条件并重试:
The query in Example 6-16 does not give us what we want because it has a stopping condition of any sensor vertex on line 2 with until(hasLabel("Sensor")). Instead, we want to recursively walk any depth until we find all sensors. Let’s remove this condition and try again:
tower=dev.V().has("Tower","tower_name","Georgetown").next()// get Georgetowndev.V(tower).// look up a towerrepeat(__.in("send").// keep walking in the send edgesimplePath())// remove cycles
tower=dev.V().has("Tower","tower_name","Georgetown").next()// get Georgetowndev.V(tower).// look up a towerrepeat(__.in("send").// keep walking in the send edgesimplePath())// remove cycles
如果你在 DataStax Studio 中运行了我们查询的第二个版本,你很可能看到表6-5中的错误:
If you ran this second version of our query in DataStax Studio, you most likely saw the error in Table 6-5:
| 系统错误 |
|---|
请求评估超出了配置的阈值 Request evaluation exceeded the configured threshold of |
请求的 realtime_evaluation_timeout 为 30000 毫秒 realtime_evaluation_timeout at 30000 ms for the request |
此错误的核心在于以递归方式遍历图中的树的麻烦。我们从树的根部开始,对树上的所有叶子进行全面搜索。
At the heart of this error is the trouble of recursively walking through trees in a graph. We were starting at the root of a tree and completing a full search down to all the leaves in the tree.
这是极其昂贵的。
This is extremely expensive.
有很多方法可以解决表 6-5中的错误。一种方法是限制移动器远离其起点的深度。
There are many ways to address the error in Table 6-5. One way is to limit how deep a traverser travels away from its starting point.
您可以控制遍历器使用该步骤执行循环的次数times(x)。该模式repeat(<traversal>).times(x)是 Gremlin 中限制递归遍历深度的最流行方法之一。在这个模式中,该值x告诉遍历器执行重复循环x的次数。
You control the number of times a traverser executes a loop with the times(x) step. The pattern repeat(<traversal>).times(x) is one of the most popular ways to limit the depth of a recursive traversal in Gremlin. In this pattern, the value x tells a traverser to perform the repeat loop x number of times.
在以下查询中,我们显示repeat(<traversal>).times(3)。这意味着从塔开始,遍历只走出三条in()边然后停止:
In the following query, we show repeat(<traversal>).times(3). This means that from a tower, a traversal walks out only three in() edges and then stops:
tower=dev.V().has("Tower","tower_name","Georgetown").next()// Georgetowndev.V(tower).// look up Georgetownrepeat(__.in("send").// repeat walking in the send edgesimplePath()).// remove cyclestimes(3).// repeat only 3 times totalpath().// get the pathby(coalesce(values("tower_name",// if a tower, return tower_name"sensor_name")))// else, return the sensor name
tower=dev.V().has("Tower","tower_name","Georgetown").next()// Georgetowndev.V(tower).// look up Georgetownrepeat(__.in("send").// repeat walking in the send edgesimplePath()).// remove cyclestimes(3).// repeat only 3 times totalpath().// get the pathby(coalesce(values("tower_name",// if a tower, return tower_name"sensor_name")))// else, return the sensor name
结果是:
The results are:
{"labels":[[],[],[],[]],"objects":["Georgetown","1235466","1257118","1201412"]},{"labels":[[],[],[],[]],"objects":["Georgetown","1290383","1027840","1055155"]},{"labels":[[],[],[],[]],"objects":["Georgetown","1235466","1059089","1255230"]},...
{"labels":[[],[],[],[]],"objects":["Georgetown","1235466","1257118","1201412"]},{"labels":[[],[],[],[]],"objects":["Georgetown","1290383","1027840","1055155"]},{"labels":[[],[],[],[]],"objects":["Georgetown","1235466","1059089","1255230"]},...
使用示例数据进行深度限制的好处是,我们现在可以执行本章的最终查询的一部分。但是,我们将问题的范围从查找所有传感器缩小到仅查找特定深度内的传感器,即 3。通过限制深度,我们遗漏了更多可到达的传感器。
The benefits of depth limiting with our example data is that we can now perform part of our final query for this chapter. However, we reduced the scope of the question from finding all sensors to only those sensors within a specific depth, namely 3. There are many more reachable sensors that we are missing by limiting depth.
To find them all, we need to revisit time in our example data.
我们意识到我们最后一个问题让您久久不能平静。
We realize we left you hanging with that last query.
我们设定了从根(塔)向下遍历所有叶子(传感器)的需求,但结果并不如预期。也就是说,您作为 Edge Energy 数据工程师的旅程尚未结束。我们将继续使用 Edge Energy 的示例,进入下一章,解释如何调整设置以回答我们的问题。
We set up the need to walk from a root (a tower) down to all leaves (sensors), and it didn’t work as expected. All that is to say, your journey as a data engineer for Edge Energy isn’t over quite yet. We will keep using Edge Energy’s example as we transition into the next chapter where we explain how to adjust our setup to answer our question.
让我们更深入地研究树的结构,找到能让我们摆脱这个问题的分支。我们将教您如何通过限制分支因子、限制深度和删除循环来修剪查询中处理的数据。如果您现在还没有被这些糟糕的双关语弄得翻白眼,那就等着吧。
Let’s travel deeper into the structure of our trees to find the branches that get us out of this forest of problems. We are going to teach you how to prune the data you process in your query by limiting your branching factor, limiting by depth, and removing cycles. And if your eyes aren’t rolling at the terrible puns by now, just wait.
无论您是在建模公司结构还是物联网传感器通信的无界网络,分层数据都非常适合图技术。
Whether you are modeling corporate structures or unbounded networks of IoT sensor communication, hierarchical data fits very well into graph technologies.
正如我们所见,尤其是对于无界和分层数据,使用图技术时,磁盘上的数据和使用数据之间的心理距离要短得多。然而,正如我们在上一章末尾看到的那样,使用富有表现力的语言和自然模型来回答简单的问题可能会带来意想不到的行为。
As we see it, especially with unbounded and hierarchical data, the mental distance between the data on disk and using it is much shorter when you use graph technology. However, as we saw at the end of the previous chapter, simple questions with expressive languages and natural models can open the door to unexpected behavior.
也就是说,很容易想到从树的根部开始一直走到树叶。而图技术使这一过程的代码变得非常简单。
Namely, it is easy to think about starting at the root of a tree and walking all the way down to its leaves. And graph technologies enable the code for this to be quite simple.
然而,推理复杂的树状结构问题所带来的简单性掩盖了处理数据自然层次结构的复杂性。
However, the simplicity that comes with reasoning about complex, tree-structured problems obfuscates the complexity of processing the data’s natural hierarchical structure.
本章分为四个主要部分。每个部分都以前一节为基础,逐步完成边缘属性的建模,以解决第 6 章time末尾的错误。
This chapter will have four main sections. Each section builds upon the previous one to walk through modeling the time property on edges to resolve our error at the end of Chapter 6.
在第一部分中,我们将在上一章介绍的数据的基础上添加两个复杂性:边上的时间和有效路径。第二部分深入探讨了为什么有效的通信树可以减少要处理的数据量。我们将在本节中更新并介绍图模式的生产版本。本章的第三和第四部分将重新讨论上一章中的同一组查询。不过,这一次,我们将应用对有效树和新生产模式的了解来显著减少每个查询中处理的数据量。
In the first section, we’ll build upon the data introduced in the last chapter by adding two complexities: time on edges and valid paths. The second section delves into why a valid communication tree reduces the amount of data to process. We will update and walk through the production version of our graph schema in this section. The third and fourth sections of this chapter revisit the same set of queries from the last chapter. This time, however, we will apply our knowledge of valid trees and the new production schema to significantly reduce the amount of data processed in each query.
在本章结束时,您将掌握开始在自己的数据中使用树所需的一切。我们认为第6章和第7 章的内容包含精简但完整的介绍,介绍如何在生产应用程序中使用图技术处理分层结构化数据。
At the end of this chapter, you will have everything you need to start working with trees in your own data. We consider the content in Chapters 6 and 7 to contain a streamlined yet complete introduction to working with hierarchical structured data in a production application with graph technologies.
为了帮助您实现这一目标,让我们回到为此示例创建的数据并在整个时间范围内跟踪边缘。
To help get you there, let’s go back to the data we created for this example and follow the edges throughout time.
我们在“利用我们的传感器数据理解层次结构”中创建和引入的数据模拟了传感器如何相互发送数据以及向蜂窝塔发送数据。我们在电力公司 Edge Energy 的背景下引入了这些数据。Edge Energy 的数据工程师必须建立一个能够在塔发生故障时报告传感器覆盖情况的系统。
The data we created and introduced in “Understanding Hierarchies with Our Sensor Data” simulates how sensors send data to each other and to cell towers. We introduced this data within the context of a power company, Edge Energy. The data engineers at Edge Energy have to build a system capable of reporting sensor coverage in the event of a tower failure.
这引出了数据中的时间概念。传感器以特定的时间间隔在整个网络中收集和发送数据。这意味着我们图中的顶点数量将是固定的,而图中的关系会随着时间的推移而增长。
This bring us to the concept of time in our data. The sensors collect and send data throughout the network at specific time intervals. This means that the number of vertices in our graph will be fixed, and it is the relationships in the graph that grow over time.
timestep我们利用边缘的属性来对时间间隔内的动态通信进行建模。让我们看看图 7-1中的数据,看看时间是如何成为通信网络的一部分的。
We model the dynamic communication over time intervals with a timestep property on the edges. Let’s look at our data in Figure 7-1 to see how time is part of the communication network.
第 6 章中的示例与图 7-1中的示例的唯一区别在于边缘上包含了时间。
The only difference between the examples in Chapter 6 and Figure 7-1 is the inclusion of time on the edges.
考虑图 7-1Seattle底部的塔。,位于 的右下方,与 值有一条边。在我们的应用中,这意味着传感器在和 处向塔发送了信息。换句话说,这个传感器直接与塔连接了两次。您还可以看到与、、和 处的附近邻居连接。Sensor SSeattle[0,3]Seattletimestep 0timestep 3SeattleSensor Stimesteps 1245
Consider the Seattle tower at the bottom of Figure 7-1. Sensor S, to the lower right of Seattle, has an edge with the values [0,3]. In our application, this means that the sensor sent information to the Seattle tower at timestep 0 and timestep 3. In other words, this sensor directly connected with the Seattle tower twice. You can also see that Sensor S connected with a nearby neighbor at timesteps 1, 2, 4, and 5.
timestep边上的属性如何增强本章示例中的通信网络为了理解如何在即将到来的查询中使用边缘时间,我们需要介绍四个主题。这四个主题将是接下来的四个部分:
To understand how to use time on our edges in our upcoming queries, we need to introduce four topics. These four topics will be the next four sections:
自下而上地理解时间
Understanding time from the bottom up
自下而上的有效路径
Valid paths from the bottom up
自上而下理解时间
Understanding time from the top down
自上而下的有效路径
Valid paths from the top down
首先让我们向您展示如何将数据从传感器传输到塔上。
Let’s start with showing you how to walk through the data from sensors up to towers.
为了帮助您了解如何使用时间,请结合上下文来考虑。回想一下第 6 章中的设置,第一个查询从叶子节点向上走到根节点。这是从传感器走到接收数据的塔的过程。
To help you understand how to use time, consider it in context. Recalling our setup from Chapter 6, the first queries walk from the leaves up to the root. This is a walk from the sensors to the tower that received their data.
从传感器到塔的每一次行走不再是有效的行走,因为我们必须考虑沿途通信的时间。 也就是说,从 处的传感器传递的消息timestep 3将从 处的接收者传递timestep 4。让我们看一个例子;我们首先放大有效的如图 7-2所示,从Sensor S走到附近的塔。
Every walk from a sensor to a tower is no longer a valid walk because we have to consider the timing of the communication along the way. That is, a message passed from a sensor at timestep 3 will be passed along from its recipient at timestep 4. Let’s look at an example; we’ll start by zooming in on valid walks from Sensor S to nearby towers in Figure 7-2.
Sensor S附近塔的有效步行路线图 7-2中的示例显示了从 到 塔 的五条有效路径Sensor S。让我们先介绍两种情况,然后向您展示从哪里开始寻找其他三种情况。
The examples in Figure 7-2 show five valid paths from Sensor S to towers. Let’s walk through two scenarios and then show you where to start to find the other three.
第一条有效路径是跟随 发送的第一条消息Sensor S。timestep 0这条路径从 开始Sensor S — 0 → Seattle。这条路径相当简单。
The first valid path is to follow the first message sent by Sensor S at timestep 0. This walk goes from Sensor S — 0 → Seattle. This one is pretty easy.
举一个更复杂的例子,让我们沿着从 出发的第二条通信路径走Sensor S。第二条路径从 开始timestep 1。在这里我们发现了一条更深层的路径。这条路径从:
For a more complicated example, let’s follow the second communication path that leaves from Sensor S. The second path starts at timestep 1. Here we find a much deeper path. This walk goes from:
传感器 S — 1 → A
A — 2 → B
B — 3 → C
C — 4 → D
D — 5 → FirstHillSensor S — 1 → A
A — 2 → B
B — 3 → C
C — 4 → D
D — 5 → FirstHill
当我们沿着这些路径前进时,我们会以每次增加 1 的方式穿越时间。您可以继续从图 7-2中遍历其他有效路径。第三条有效路径从 开始timestep 2,第四条有效路径从 开始timestep 3,最后一条有效路径从 开始timestep 4。
When we are following these paths, we walk through time by incrementing by one along the way. You can keep walking through the other valid paths from Figure 7-2. The third valid path starts at timestep 2, the fourth valid path starts at timestep 3, and the last one starts at timestep 4.
在第 6 章中,我们看到了 7 条路径Sensor S,但现在我们知道其中两条是不可能的!
In Chapter 6 we saw seven paths from Sensor S, but now we know that two of them weren’t possible!
我们还可以扁平化数据,以更分层的形式检查这些路径。让我们看看图 7-3Sensor S中从 到 附近塔的层次结构。
We also can flatten the data and examine these paths in a more hierarchical form. Let’s look at the hierarchy from Sensor S to nearby towers in Figure 7-3.
Sensor S了解从附近任何一座塔到地面站的通信层次图 7-3显示了与图 7-2相同的数据,但形式更加层次化。通过计算进入塔顶点的边数,可能更容易看到图 7-3中的四条唯一路径。在这里,我们还可以看到图 7-2中未涵盖的其他路径。我们看到以下路径:
Figure 7-3 shows the same data as Figure 7-2 but in a more hierarchical form. It may be easier to see the four unique paths of Figure 7-3 by counting the edges that go into the tower vertices. Here, we can also see the other paths we didn’t cover from Figure 7-2. We see the path of:
传感器 S — 2 → A
A — 3 → FirstHillSensor S — 2 → A
A — 3 → FirstHill
无论您喜欢哪种思维模型,我们都在深入研究这些树的工作原理,原因如下。首先也是最重要的一点,此数据集中顶点之间的连接和通信的动态特性代表了现场设备如何将其数据传输回数据库的真实场景。
Whichever mental model you prefer, we are digging into how these trees work for a few reasons. First and most important, the dynamic nature of connections and communication between the vertices in this dataset represents a real-world scenario for how devices in the field transmit their data back to databases.
其次,在边缘上使用时间可以帮助我们了解在现实世界中可以观察到哪种类型的通信树。让我们来看看从传感器走到发射塔时,整个时间范围内的有效路径和无效路径。
Second, the use of time on edges sets us up to understand what type of communication tree would be observable in the real world. Let’s take a look at valid and invalid paths throughout time when walking up from sensors to towers.
让我们首先思考一下,当你从传感器跟踪时间到塔时,如何正确地解释时间。
Let’s start by thinking about how to correctly interpret time when you follow it from a sensor to a tower.
从概念上讲,你可以将传感器到信号塔的有效路径视为按顺序将数据传递到下一个传感器。在数据中,当你走过边缘时,有效路径会将时间增加一。
Conceptually, you can think of a valid path from a sensor to a tower as passing the data on to the next sensor in order. In the data, a valid path increments time by one as you walk through the edges.
从概念上讲,无效路径是指您尝试以无序方式将信息传递给另一个传感器。这就像错过火车:您要么太晚到达,要么太早到达。
Continuing on conceptually, an invalid path is when you try to pass information to another sensor out of order. This is like missing your train: you got there either too late or too early.
为了将其付诸实践,让我们探索有效和无效路径的示例。首先,图 7-4显示了一条无效路径;它无效是因为传感器的数据接收较晚。
To put this into practice, let’s explore examples of valid and invalid paths. First, Figure 7-4 shows an invalid path; it’s invalid because the sensor’s data was received late.
Sensor A从到 的有效路径(左)和无效路径(右)示例Seattle图 7-4Sensor A显示了从到塔的两条路径Seattle。左侧路径有效,因为沿途边缘的时间正确地递增 1。右侧路径无效,因为与传感器的每次交换都乱序。 在与 通信后,Sensor A将其信息发送到。图 7-4中右侧路径上的每次交换都会发生同样的问题。Sensor BSensor BSensor C
Figure 7-4 shows two paths from Sensor A to the Seattle tower. The path on the left is valid because time on the edges correctly increments by one along the way. The path on the right is invalid because each exchange with a sensor is out of order. Sensor A sends its information to Sensor B after Sensor B has communicated with Sensor C. The same problem happens with every exchange along the path on the right in Figure 7-4.
让我们看一下图 7-5中的另一种无效路径;图 7-5右侧的路径显示了传感器过早通信的实例。
Let’s take a look at another type of invalid path in Figure 7-5; the paths on the right of Figure 7-5 show instances of sensors communicating too early.
图 7-5Sensor D显示了从到Sensor A塔的路径Seattle。左侧的路径有效。Sensor D将其数据发送到Sensor C处的timestep 3, 也一样Sensor A。然后Sensor C收集所有数据并将其发送到Seattle处的塔timestep 4。
Figure 7-5 shows paths from Sensor D and Sensor A to the Seattle tower. On the left, the paths are valid. Sensor D sends its data to Sensor C at timestep 3, as does Sensor A. Then Sensor C collects all the data and sends it on to the Seattle tower at timestep 4.
图 7-5右侧的路径无效。Sensor D将其数据发送到Sensor C处;叶子timestep 0的数据发送到处(未显示)。将其数据发送到处;叶子的数据发送到处(未显示)。图 7-5显示与 处的塔进行了通信。 这意味着和 的数据不是该通信的一部分,因为它分别沿着和处的不同路径传递。Sensor DSensor Ctimestep 1Sensor ASensor Ctimestep 1Sensor ASensor Ctimestep 2Sensor CSeattletimestep 3DAtimestep 1timestep 2
The paths on the right in Figure 7-5 are invalid. Sensor D sends its data to Sensor C at timestep 0; the data for Sensor D leaves Sensor C at timestep 1 (not shown). Sensor A sends its data to Sensor C at timestep 1; the data for Sensor A leaves Sensor C at timestep 2 (not shown). Figure 7-5 shows that Sensor C communicated with the Seattle tower at timestep 3. This means that the data for D and A was not a part of that communication because it was passed along different paths at timestep 1 and timestep 2, respectively.
这涵盖了我们从叶子到根部时需要了解的有关数据的所有信息。让我们看看如何反向应用时间。
That covers everything we need to know about our data when we walk from leaves to roots. Let’s look at how we apply time in the reverse direction.
我们示例中的最后一个概念适用于从塔走到所有传感器时的时间。这些路径代表我们如何找出在特定时间连接到塔的传感器。
The last concept for our example applies time as we walk from towers down to all sensors. These paths represent how we would figure out which sensors connected to a tower at a certain time.
这里的关键是,从塔到传感器的有效路径必须按照时间的递减顺序,正好是 1。
The key here is that a valid path from a tower down to a sensor has to follow time in decreasing order, exactly by 1.
为了看到这一点,让我们放大并检查图 7-6WestLake中向塔发送信息的网络。图 7-6中的信息非常密集。为了理解它所显示的内容,我们建议从您知道如何跟踪的内容开始,即遵循从传感器到塔的有效路径。以这种方式开始可以更轻松地实现我们的最终目标:从下往上逆向走到传感器。WestLake
To see this, let’s zoom in and examine the network that sends its information to the WestLake tower in Figure 7-6. Figure 7-6 is dense with information. To understand what it is showing, we recommend starting with what you know how to trace by following valid paths from sensors up to the tower. Starting this way makes it much easier to accomplish our ultimate goal: walking in reverse from WestLake down to sensors.
让我们从图 7-6Sensor M右下角的开始。我们希望遵循从传感器到 的有效路径:WestLake
Let’s start with Sensor M, at lower right in Figure 7-6. We want to follow the valid path from the sensor up to WestLake:
传感器 M — 2 → I
我 — 3 → F
F — 4 → 西湖Sensor M — 2 → I
I — 3 → F
F — 4 → WestLake
目标是能够从 反向看到这一点WestLake。Sensor M因此,追踪相同的路径,但方向相反:
The goal is to be able to see this in reverse from WestLake back to Sensor M. So trace that same path but in the opposite direction:
西湖 – 4 → F
F-3→I
I – 2 → 传感器 MWestLake – 4 → F
F – 3 → I
I – 2 → Sensor M
WestLake与其连接的所有传感器的有效路径让我们展开到达 的所有有效路径WestLake。timestep 4从根 到WestLake与其连接的所有传感器的层次结构如图 7-7所示。图 7-7中的所有路径都可以在图 7-6中找到。我们只需解开它们在地图上的表示即可查看它们的层次结构。
Let’s unroll all valid paths that arrive at WestLake at timestep 4. The hierarchy from the root, WestLake, down to all sensors that connected to it is shown in Figure 7-7. All paths from Figure 7-7 can be found in Figure 7-6. We just untangled their representation on the map to look at their hierarchical structure.
WestLake了解塔台从任何传感器接收到的通信层次timestep 4在图 7-7所示的分层结构中,从塔楼向后走到传感器更容易。例如,沿着此图中从WestLake塔楼向后的路径走Sensor M。路径与之前相同,但在此图中更容易看到时间是如何减少的。
It is easier to walk backwards from towers to sensors in the hierarchical structure shown in Figure 7-7. For example, follow the path backwards from the WestLake tower to Sensor M in this image. The path is the same as before, but it is easier to see how time decreases in this image.
当您查看如图 7-7所示的数据层次结构时,我们会发现更容易看到路径。但您可能更喜欢根据地理位置来跟踪它们,如图7-6所示。
We find it easier to see the paths when you look at the hierarchical structure of the data as shown in Figure 7-7. But you may prefer to follow them according to their geo-location, like in Figure 7-6.
只要你能知道如何减少我们从塔走回到传感器的时间,那么我们就实现了我们的目标。
As long as you can see how to decrease time as we walk from a tower back down to sensors, then we have achieved our goal.
在我们更新生产模式之前,我们还需要了解最后一个概念:从根到叶的有效路径。
Before we can update our production schema, we have one last concept to understand: valid paths from roots to leaves.
想想当我们找到从塔到所有传感器的有效路径时我们在做什么。 我们已经扭转了从传感器走到塔的流程。
Think about what we are doing when we find a valid path from a tower down to all sensors. We have reversed the process we used for walking up from sensors to a tower.
在这次逆转中,我们沿着时间倒退。具体来说,我们timestep一路上将边缘的值减一。
In this reversal, we walked backwards in time. Specifically, we decreased the timestep values on the edges by one along the way.
让我们看看另一个有效路径和无效路径的并列示例。不过,这一次,我们考虑从塔到传感器的视角,回顾过去。图 7-8右侧的通信路径无效,因为通信在路径上太晚或太早。
Let’s look at another side-by-side example of valid and invalid paths. This time, however, we are considering the perspective from the tower down to the sensors, back through time. The communication path on the right in Figure 7-8 is invalid because communication was too late or too early along the path.
图 7-8Seattle显示了从塔到 的路径Sensor A。左侧的路径有效:
Figure 7-8 shows paths from the Seattle tower to Sensor A. On the left, the path is valid:
西雅图 – 4 → D
D-3→C
C–2→B
B–1→ASeattle – 4 → D
D – 3 → C
C – 2 → B
B – 1 → A
Sensor D将左侧的路径与右侧的无效表示进行对比。右侧的路径无效,因为收到信息的时间:
Contrast the path on the left with its invalid representation on the right. The path on the right is invalid because of the time at which Sensor D received its information:
西雅图 – 3 → D(太晚了,无法进行下一次转机)
D – 4 → C(对于下一次连接来说太早)
C–2→B
B–1→ASeattle – 3 → D (too late for the next connection)
D – 4 → C (too early for the next connection)
C – 2 → B
B – 1 → A
Seattle发送数据给Sensor D太晚;Sensor D已经将其数据传递给。和Sensor C之间的通信路径也无效。DB
Seattle sent its data to Sensor D too late; Sensor D had already passed its data to Sensor C. The communication path is also invalid between D and B.
回溯时,推理时间就困难得多。这里的诀窍是,从塔到传感器的有效路径必须按照我们的属性以递减顺序排列,正好是 1。
It is much harder to reason about time when going backwards. The trick here is that a valid path from a tower down to a sensor has to follow our property in decreasing order, exactly by 1.
理解此数据集中的时间对于建模来说很容易:我们将其添加到我们的边缘。我们将在下一节中看到生产模式。
Understanding time in this dataset is easy for modeling: we added it to our edges. We will see the production schema in a coming section.
当我们想在查询中使用时间时,细节和困难就出现了。除了本章中的所有图像和细节之外,本例中使用时间归结为以下提示:
The detail and difficulty come in when we want to use time in our queries. Aside from all of the images and detail in this chapter, using time in this example boils down to the following tip:
上升时时间增加,下降时时间减少。当此经验法则不成立时,该路径无效,应从结果中过滤掉。
Time goes up on the way up, and time goes down on the way down. When this rule of thumb isn’t true, the path is invalid and should be filtered out of the results.
现在我们知道了如何在边上使用时间,让我们解释一下为什么这可以解决第 6 章中的错误。我们希望将结果限制在有效路径上,因此,我们正在减轻图的分支因素。
Now that we know how to use time on our edges, let’s explain why this resolves our error from Chapter 6. We want to limit our results to valid paths, and therefore, we are mitigating our graph’s branching factor.
让我们解释一下分支因子是什么,以及为什么你需要了解这个例子和其他例子的分支因子。
Let’s explain what branching factor is and why you need to know about it for this example and others.
我们以一个问题结束了第六章。由于数据的分支因素,我们无法从一座塔走到与其连接的所有传感器。
We ended Chapter 6 with a problem. We were unable to walk from a tower down to all sensors that connected to it because of the data’s branching factor.
让我们深入研究这个概念的细节,并说明遍历高度分支数据所带来的处理复杂性。
Let’s dig into the details of this concept and illustrate the processing complexity that comes with walking through highly branching data.
分支因子是当您从一个顶点通过关系走到许多其他顶点时发生的现象。正式地,我们将分支因子定义如下:
Branching factor is what happens when you walk from one vertex through relationships to many other vertices. Formally, we define branching factor as follows:
A graph’s branching factor (BF) is the expected, or average, number of edges for any vertex.
你可以认为这是将一个进程或一个遍历器拆分成多个。我们在图 7-9中对此进行了说明。
You can think of this as splitting one process, or one traverser, into many. We illustrate this in Figure 7-9.
WestLake的分支因子示例在图 7-9中,WestLake塔顶有七条边与七个不同顶点相邻。我们称该WestLake塔的分支因子为 7。
In Figure 7-9, the WestLake tower vertex has seven edges adjacent to seven unique vertices. We say the WestLake tower has a branching factor of 7.
数据的分支因子会影响遍历性能。例如,从WestLake塔开始的遍历会在管道中创建一个遍历器。当您从塔走到所有传入顶点时,单个遍历器会针对每个可能的边进行拆分。我们最终在传感器顶点上总共有七个遍历器,如图 7-9WestLake底部所示。
Your data’s branching factor affects traversal performance. For example, a traversal that starts at the WestLake tower creates one traverser in the pipeline. When you walk from the WestLake tower to all incoming vertices, the single traverser splits for every possible edge. We end up with seven total traversers on the sensor vertices, shown at the bottom of Figure 7-9.
遍历的处理开销与图的分支因子相关。粗略地说,遍历器的数量与执行遍历所需的线程数相对应。您可以使用图 7-10所示的公式来计算处理查询图所需的线程数。
The processing overhead for a traversal correlates to a graph’s branching factor. Roughly, the number of traversers maps to the number of threads required to execute a traversal. You can calculate the number of threads required to process a query map with the equation shown in Figure 7-10.
n和图的分支因子计算的开销BF这听起来很棒,但你为什么要关心呢?
That sounds great and all, but why should you care?
假设图的预期分支因子为 3。从单个顶点开始,您有 1 个遍历器。走一个邻域会产生 3 个遍历器。走两个邻域会产生 9 个;走三个邻域会产生 27 个。当您走四个邻域时,仅在该级别,您就需要处理 81 个遍历器。您创建的遍历器总数为 1 + 3 + 9 + 27 + 81 = 121。
Let’s say your graph’s expected branching factor is 3. Starting at a single vertex, you have 1 traverser. Walking one neighborhood away creates 3 traversers. Two neighborhoods away creates 9; three neighborhoods away creates 27. When you are four neighborhoods away, you are processing 81 traversers, just for that level. The total number of traversers you have created is 1 + 3 + 9 + 27 + 81 = 121.
指数增长很快就会失控。图 7-11显示了增长速度有多快。
The exponential growth can quickly get out of hand. Figure 7-11 shows just how quickly.
图 7-11传达的信息是,当您探索多个数据邻域时,图的分支因子会导致您必须处理的数据量呈指数增长。粗略地说,您可以将一个 Gremlin 遍历器等同于计算机中的一个线程。这意味着探索数据所需的线程数量呈指数增长。
The message from Figure 7-11 is that a graph’s branching factor yields exponential growth on the amount of data that you have to process as you explore multiple neighborhoods of data. Loosely speaking, you can equate one Gremlin traverser to one thread in your computer. This means that the number of threads required to explore your data grows exponentially.
在 Apache Cassandra 中使用图数据的优点在于我们已经拥有控制数据分支因子所需的所有工具。缓解查询分支因素的主要方法是回顾如何在磁盘上存储数据。
The beauty of working with graph data in Apache Cassandra is that we already have all the tools necessary to tame your data’s branching factor. A primary way to mitigate a query’s branching factor goes back to how you store your data on disk.
我们可以提供的最佳技巧之一是使用边缘上的属性,让您能够在查询期间导航数据的分支因子。
One of the best tips we can offer is to use properties on edges to give yourself a way to navigate your data’s branching factor during queries.
将边缘聚集在磁盘上,以便您可以在查询中对它们进行排序并减轻数据分支因子的影响。
Cluster your edges on disk so that you can sort through them in your queries and mitigate the effect of your data’s branching factor.
让我们应用对分支因子的理解。我们希望更新我们的开发模式,以便我们的生产查询较少受到树的分支因子的影响。
Let’s apply our understanding of branching factor. We want to update our development schema so that our production queries are less affected by our tree’s branching factor.
我们对边缘时间的新认识和在开发中的探索为我们的生产模式提供了两点优化。首先,我们对边上的时间、有效路径和分支因子有了新的理解,这表明了为什么我们需要按时间对边进行聚类。其次,第 6 章中的查询表明我们将send在两个方向上遍历边。因此,我们的第二个变化是在send边标签上添加一个物化视图,以便在遍历中双向使用。
Our new understanding of time on edges and our exploration in development give us two optimizations for our production schema. First, our new understanding of time on edges, valid paths, and branching factor indicates why we need to cluster our edges by time. Second, our queries in Chapter 6 illustrated that we will be traversing the send edge in both directions. Therefore, our second change will be to add a materialized view on the send edge labels for bidirectional usage in traversals.
图 7-12说明了包含这些更改的概念数据模型的生产版本。
Figure 7-12 illustrates a production version of our conceptual data model with these changes.
在图 7-12中,每个边上的物化视图的使用情况send用一条反向虚线表示。我们还看到,我们的边将按时间聚类,用timestep (CK↓)符号表示递减。
In Figure 7-12, the use of materialized views on each send edge is indicated by a dotted line going in the reverse direction. We also see that our edges will be clustered by time, decreasing with the timestep (CK↓) notation.
Applying the Graph Schema Language (GSL), we cluster our edges by time with:
schema.edgeLabel("send").ifNotExists().from("Sensor").to("Sensor").clusterBy("timestep",Int,Desc).create()
schema.edgeLabel("send").ifNotExists().from("Sensor").to("Sensor").clusterBy("timestep",Int,Desc).create()
schema.edgeLabel("send").ifNotExists().from("Sensor").to("Tower").clusterBy("timestep",Int,Desc).create()
schema.edgeLabel("send").ifNotExists().from("Sensor").to("Tower").clusterBy("timestep",Int,Desc).create()
We create these indexes in our schema code with:
schema.edgeLabel("send").from("Sensor").to("Sensor").materializedView("sensor_sensor_inv").ifNotExists().inverse().create()schema.edgeLabel("send").from("Sensor").to("Tower").materializedView("sensor_tower_inv").ifNotExists().inverse().create()
schema.edgeLabel("send").from("Sensor").to("Sensor").materializedView("sensor_sensor_inv").ifNotExists().inverse().create()schema.edgeLabel("send").from("Sensor").to("Tower").materializedView("sensor_tower_inv").ifNotExists().inverse().create()
上述代码中的边标签语法为每个相应的send边标签创建了物化视图。通过使用inverse()便捷方法,我们将相同的顺序应用于反向的边。这意味着边将具有timestep 反向的聚类键。
The edge label syntax in the preceding code creates materialized views for each respective send edge label. By using the inverse() convenience method, we are applying the same order to the edges in the reverse direction. This means that the edges will have a clustering key of timestep in the reverse direction.
为了强化遍历驱动建模,您希望生产边标签位于您最常遍历的方向,而物化视图位于不太常见的方向。
To reinforce traversal driven modeling, you want your production edge labels to be in the direction that you will most commonly traverse and the materialized views to be in the less common direction.
与第 6 章相比,所提供的数据以及我们为示例加载数据的方式没有变化。但是,回想一下的前五行,如表 7-1Sensor_send_Sensor.csv所示。
There are no changes from Chapter 6 to the provided data or to how we load it for our example. However, recall the first five lines of Sensor_send_Sensor.csv, shown in Table 7-1.
| 输出传感器名称 | 时间步长 | 输入传感器名称 |
|---|---|---|
103318117 103318117 |
1 1 |
126951211 126951211 |
1064041 1064041 |
2 2 |
1307588 1307588 |
1035508 1035508 |
2 2 |
1307588 1307588 |
1282094 1282094 |
1 1 |
1031441 1031441 |
在第 6 章timestep中,我们的模式在边标签上没有send。因此,我们的加载过程省略了边数据上的时间戳。
In Chapter 6, our schema did not have a timestep on the send edge labels. Therefore, our loading process omitted the timestamps on the edge data.
但是,本章的架构用于timestep对边进行聚类。因此,当我们使用相同的流程加载完全相同的数据时,我们将获得带有时间的边。要查看代码,请前往本书 GitHub 存储库中的数据目录,获取本章的数据和加载脚本。
However, our schema for this chapter uses timestep to cluster our edges. Therefore, when we load the exact same data with the same process, we will have edges with time on them. To see the code, please head to the data directory within this book’s GitHub repository for the data and loading scripts for this chapter.
让我们应用我们的理解时间、有效路径和分支因子,以重构第 6 章中的查询。
Let’s apply our understanding of time, valid paths, and branching factor to refactor our queries from Chapter 6.
我们想问和以前同样的问题,但现在我们想利用边缘的时间来仅考虑有效路径。让我们从第一个查询开始,看看它何时将数据传送到另一个传感器或塔。
We want to ask the same questions as before, but now we want to use time on the edges to consider only valid paths. Let’s start with our first query and see when it communicated data to another sensor or tower.
这是我们之前开始的同一个问题,但这次我们使用了不同的传感器:104115939。我们想将属性添加到结果图中。这需要在遍历中使用边并向我们的地图添加一个额外的元素。让我们看看示例 7-1timestep 中的查询,然后看看示例结果。然后我们将介绍下面的代码。
This is the same question that we started with before, but we are using a different sensor this time: 104115939. We want to add the timestep property into the map of results. This requires using the edge in our traversal and adding an additional element to our map. Let’s look at the query in Example 7-1 and then at the example results. Then we will walk through the code below.
1sensor=g.V().has("Sensor","sensor_name","104115939").next()2g.V(sensor).// look up the sensor3outE("send").// walk out and stop on all edges4project("Label","Name","Time").// create a map for each edge5by(__.inV().// traverse in6label()).// values for the first key7by(__.inV().// traverse in8coalesce(values("tower_name"),// values for the 2nd key if a tower9values("sensor_name"))).// otherwise return sensor_name10by(values("timestep"))// values for the 3rd key: "Time"
1sensor=g.V().has("Sensor","sensor_name","104115939").next()2g.V(sensor).// look up the sensor3outE("send").// walk out and stop on all edges4project("Label","Name","Time").// create a map for each edge5by(__.inV().// traverse in6label()).// values for the first key7by(__.inV().// traverse in8coalesce(values("tower_name"),// values for the 2nd key if a tower9values("sensor_name"))).// otherwise return sensor_name10by(values("timestep"))// values for the 3rd key: "Time"
结果是:
And the results are:
{"Label":"Sensor","Name":"104115918","Time":"1"},{"Label":"Sensor","Name":"10330844","Time":"0"}
{"Label":"Sensor","Name":"104115918","Time":"1"},{"Label":"Sensor","Name":"10330844","Time":"0"}
在示例 7-1中,查询设置与我们之前看到的一样。我们创建遍历并在第 2 行用单个顶点填充遍历管道。在第 3 行,我们移动到传感器的所有传出边。第 4 行使用project创建一个具有三个键的映射对象:Label、Name和Time。映射中的值Label将用第 5 行的遍历结果填充:边另一侧的传入顶点的标签。映射中的值Name将用第 7 行步骤的 try/catch 模式填充coalesce:塔的名称或传感器的名称。最后,映射中的键值Time将用第 10 行中的遍历填充:timestep从边缘访问属性值。
In Example 7-1, the query sets up as we have seen before. We create a traversal and populate the traversal pipeline on line 2 with a single vertex. On line 3, we move to all outgoing edges from the sensor. Line 4 uses project to create a map object with three keys: Label, Name, and Time. The values in the map for Label will be filled with the traversal from line 5: the label of the incoming vertex on the other side of the edge. The values in the map for Name will be filled with the try/catch pattern of the coalesce step on line 7: either the name of a tower or the name of a sensor. Last, the values in the map for the key Time will be filled with the traversal from line 10: accessing the property value timestep from the edge.
让我们使用示例 7-1中的模式并沿着任意路径到达塔,但我们也想查看timestep 沿途的价值。
Let’s use the pattern from Example 7-1 and follow any path to a tower, but we want to also look at the timestep values along the way.
下一个查询与我们在第 6 章中设置的查询相同,但我们将timestep边缘的属性添加到结果负载中。从这里,我们将能够了解哪些路径有效,哪些路径无效。让我们看看示例 7-2中的查询。我们稍后会深入研究细节。
The next query is the same one we set up in Chapter 6, but we are adding the timestep property from the edge into the result payload. From here, we will be able to understand which paths are valid and which are invalid. Let’s look at the query in Example 7-2. We will delve into the details afterward.
1sensor=g.V().has("Sensor","sensor_name","104115939").next()2g.V(sensor).// look up a sensor3as("start").// label it "startingSensor"4until(hasLabel("Tower")).// until we reach a tower5repeat(outE("send").// walk out and stop on the send edge6as("send_edge").// label it "send_edge"7inV().// walk into the adjacent vertex8as("visited").// label it "visited"9simplePath()).// remove cycles10as("tower").// label it "tower"11path().// get path of vertices and edges from "start" to "tower"12by(coalesce(values("tower_name",// 1st object in the path is a vertex13"sensor_name"))).14by(values("timestep"))// 2nd object in the path is an edge
1sensor=g.V().has("Sensor","sensor_name","104115939").next()2g.V(sensor).// look up a sensor3as("start").// label it "startingSensor"4until(hasLabel("Tower")).// until we reach a tower5repeat(outE("send").// walk out and stop on the send edge6as("send_edge").// label it "send_edge"7inV().// walk into the adjacent vertex8as("visited").// label it "visited"9simplePath()).// remove cycles10as("tower").// label it "tower"11path().// get path of vertices and edges from "start" to "tower"12by(coalesce(values("tower_name",// 1st object in the path is a vertex13"sensor_name"))).14by(values("timestep"))// 2nd object in the path is an edge
在展示结果之前,我们先来看一下示例 7-2中的代码。第 2 行用单个顶点填充遍历管道。第 4 行和第 5 行中的使用until()/repeat()使用了 Gremlin 中的 while/do 模式。第 5 行确保管道中的每个遍历器都访问边send并将其标记为send_edge,以便我们可以在路径对象中引用它。第 8 行将沿途的任何顶点标记为已访问顶点,而第 10 行将标签添加tower到路径中的最后一个顶点。由于第 4 行中的停止条件,此路径上的最后一个顶点将始终是一座塔。
Let’s walk through the code from Example 7-2 before we show the results. Line 2 populates the traversal pipeline with a single vertex. The use of until()/repeat() on lines 4 and 5 uses the while/do pattern in Gremlin. Line 5 ensures each traverser in the pipeline accesses the send edge and labels it as send_edge so that we can reference it in the path object. Line 8 labels any vertex along the way as a visited vertex, while line 10 adds the label tower to the last vertex in the path. The last vertex on this walk will always be a tower due to the stopping condition in line 4.
示例 7-2中最棘手的部分是从第 11 行到第 14 行。在这里,我们by()以循环顺序应用调制器来变异路径结构中的对象,以便用有关每条路径的有意义的信息填充查询结果。
The trickiest part of Example 7-2 occurs from lines 11 through 14. Here, we apply by() modulators in round-robin order to mutate the objects in the path structure so as to populate our query’s results with meaningful information about each path.
让我们详细分析一下。
Let’s break this down.
例 7-2中的第 11 行要求每个遍历器将其路径对象填充到遍历管道中。每条路径都将遵循结构[Start, Edge, Vertex, … , Edge, Tower]。这是正确的,因为我们从传感器开始,然后反复访问一条边及其相邻顶点。
Line 11 from Example 7-2 asks each traverser to populate its path object into the traversal pipeline. Every path will follow the structure [Start, Edge, Vertex, … , Edge, Tower]. This is true because we started at a sensor and then repeatedly accessed an edge and its adjacent vertex.
by()我们在第 12 行和第 14 行上将此模式与调制器一起使用。by()第 12 行上的调制器将映射到路径对象中的偶数对象[0, 2, 4, … ]。路径对象中偶数位置的对象保证是顶点。对于任何顶点,我们希望将路径中的对象变异为仅包含顶点tower_name或其sensor_name;我们使用步骤的 try/catch 模式coalesce()来执行此操作。
We use this pattern with the by() modulators on lines 12 and 14. The by() modulator on line 12 will map to the even-numbered objects in the path object [0, 2, 4, … ]. The objects at even-numbered positions in the path object are guaranteed to be vertices. For any vertex, we want to mutate the object in the path to include only the vertex’s tower_name or its sensor_name; we use the try/catch pattern of the coalesce() step to do this.
在第 14 行,by()调制器将映射到路径对象中的奇数对象。 [1, 3, 5, … ]路径中的奇数对象保证是边。我们希望路径对象显示timestep来自特定边的;我们使用values("timestep")来进行这种变异。
On line 14, the by() modulator will map to the odd-numbered objects in the path object, [1, 3, 5, … ]. The odd-numbered objects in the path are guaranteed to be edges. We want the path object to show the timestep from a particular edge; we use values("timestep") to do this mutation.
在示例 7-3中,我们展示了示例 7-2中查询的两个结果。这些结果显示了labels来自路径对象的有效负载,以便您可以将as()查询中的每个标签映射到路径对象。由于篇幅原因,示例 7-3中的结果是我们唯一一次展示labels有效载荷;在其余的示例中,这部分结果将被省略。
In Example 7-3, we show two results from the query in Example 7-2. These results show the labels payload from the path object so that you can map each of the as() labels from the query to the path object. For space reasons, the results in Example 7-3 are the only time we will be showing the labels payload; this section of the results will be omitted throughout the rest of our examples.
...,{"labels":[["start"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited","tower"],],"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","4","103318117","5","Bellevue"]},{"labels":[["start"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited","tower"],],"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","0","103318117","5","Bellevue"]},...
...,{"labels":[["start"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited","tower"],],"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","4","103318117","5","Bellevue"]},{"labels":[["start"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited"],["send_edge"],["visited","tower"],],"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","0","103318117","5","Bellevue"]},...
示例 7-3中的第一个结果是有效路径,因为时间正确按照以下顺序排列:0,1,2,3,4,5。第二个结果是无效路径,因为边上的时间顺序乱序:0,1,2,3,0,5。第二个结果是 之后通信路径中断的示例timestep 3。
The first result in Example 7-3 is a valid path because time correctly follows in sequence: 0,1,2,3,4,5. The second result is an invalid path because the sequence of time on the edges is out of order: 0,1,2,3,0,5. The second result is an example of the communication path breaking after timestep 3.
示例 7-3中显示的两条路径指向有效树和无效树的详细信息。第一条路径是有效的,因为它按时间顺序排列,而第二条路径则不是。我们在图 7-13中可视化了生成的路径,以查看一条路径有效,另一条路径无效。
The two paths shown in Example 7-3 point to the details of valid and invalid trees. The first path is valid because it follows time sequentially, whereas the second path does not. We have visualized the resulting paths in Figure 7-13 to see how one is valid and the other is invalid.
图 7-13中的顶部路径是有效的,因为它从头到尾都遵循增量模式。图 7-13中的底部路径已损坏,因为Sensor 103318129接收数据的时间是timestep 3,但 的下一个边沿输出103318129发生在更早的时间timestep 0。
The top path in Figure 7-13 is valid because it follows an incremental pattern from start to finish. The bottom path in Figure 7-13 is broken because Sensor 103318129 receives its data at timestep 3, but the next edge out of 103318129 occurs at an earlier time, timestep 0.
当我们从传感器走到塔时,我们只需要考虑有效的树木。timestep 在我们走过数据时监控的值是本节的最后一个例子。
We need to consider only valid trees as we walk from a sensor up to a tower. Monitoring the value of timestep as we walk through our data is the final example for this section.
我们希望使用示例 7-2中的模式,但我们想要send在遍历数据时检查边上的值。这个想法本质上是为了实现示例 7-4 中看到的内容,但不会对timestep值进行硬编码。
We want to use the pattern from Example 7-2, but we want to check the value on the send edges as we walk through the data. The idea is essentially to accomplish what you see in Example 7-4, but without hardcoding the timestep values.
1sensor=g.V().has("Sensor","sensor_name","104115939").next()2g.V(sensor).// look up a sensor3outE("send").has("timestep",0).inV().// traverse edges with timestep = 04outE("send").has("timestep",1).inV().// traverse edges with timestep = 15outE("send").has("timestep",2).inV().// traverse edges with timestep = 26outE("send").has("timestep",3).inV().// traverse edges with timestep = 37outE("send").has("timestep",4).inV().// traverse edges with timestep = 48outE("send").has("timestep",5).inV().// traverse edges with timestep = 59path().// get the path from the sensor10by(coalesce(values("tower_name",// for the even position elements11"sensor_name"))).// get the vertex's ID12by(values("timestep"))// for the odd position elements
1sensor=g.V().has("Sensor","sensor_name","104115939").next()2g.V(sensor).// look up a sensor3outE("send").has("timestep",0).inV().// traverse edges with timestep = 04outE("send").has("timestep",1).inV().// traverse edges with timestep = 15outE("send").has("timestep",2).inV().// traverse edges with timestep = 26outE("send").has("timestep",3).inV().// traverse edges with timestep = 37outE("send").has("timestep",4).inV().// traverse edges with timestep = 48outE("send").has("timestep",5).inV().// traverse edges with timestep = 59path().// get the path from the sensor10by(coalesce(values("tower_name",// for the even position elements11"sensor_name"))).// get the vertex's ID12by(values("timestep"))// for the odd position elements
如果我们已经知道树的深度,则示例 7-4中的查询有效。对于任何传感器,我们都不会知道这一点,因此我们需要使用计数器变量。我们将使用计数器变量从 0 开始,然后逐一递增,直到找到塔。
The query in Example 7-4 works if we already know how deep the tree is. For any sensor, we won’t know this, and we’ll need to use a counter variable. We will want to use a counter variable to start at 0 and increment by one until we find a tower.
Gremlin 有一个步骤用于此:loops()。该loops()步骤跟踪执行重复的次数;loops()从零开始,并在重复步骤的每次迭代中加一。
Gremlin has a step for this: loops(). The loops() step keeps track of the number of times a repeat is executed; loops() starts at zero and will increment by one for every iteration of the repeat step.
步骤loops()提取遍历器经过当前循环的次数。
The loops() step extracts the number of times the traverser has gone through the current loop.
我们可以使用计数器loops()并将其与边的值进行比较timestep。将计数器与边的值进行比较timestep 将使我们能够仅考虑从起始传感器到塔的有效树。
We can use the counter from loops() and compare it to the value of an edge’s timestep. Comparing the counter to an edge’s timestep will give us the ability to consider only valid trees from our starting sensor to a tower.
让我们loops()在一条边上使用并创建一个过滤器。我们希望当一条边timestep 等于变量时,它能够通过过滤器。我们希望当一条边 不等于变量loops()时,它无法通过过滤器。虽然这个要求看起来有些牵强,但以顺序方式遍历边是很常见的。总体问题和解决方案为常见的应用模式提供了背景和可转移的解决方案。timesteploops()
Let’s use loops() and create a filter on an edge. We want an edge to pass through the filter when its timestep is equal to the loops() variable. We want an edge to fail to pass through the filter if its timestep is not equal to the loops() variable. While this requirement seems rather contrived, it is very common to walk edges in a sequential fashion. The overarching problem and solution provide context and transferable solutions to a common application pattern.
示例 7-5展示了如何loops()在 Gremlin 中使用和创建边缘过滤器。
Example 7-5 shows how to use loops() and create a filter on an edge in Gremlin.
1sensor=g.V().has("Sensor","sensor_name","104115939").next()2g.V(sensor).as("start").// look up a sensor, label it3until(hasLabel("Tower")).// until you reach a tower4repeat(outE("send").// traverse out to a send edge5as("send_edge").// label it "send_edge"6where(eq("send_edge")).// filter: an equality test7by(loops()).// an edge passes if loops() is equal to8by("timestep").// the timestep on the edge9inV().// walk to adjacent vertex10as("visited")).// label it "visited"11as("tower").// guaranteed tower; label it "tower"12path().// path from "start" to "tower"13by(coalesce(values("tower_name",// for the even position elements14"sensor_name"))).// get vertex's ID based on its label15by(values("timestep"))// for the odd position elements: time
1sensor=g.V().has("Sensor","sensor_name","104115939").next()2g.V(sensor).as("start").// look up a sensor, label it3until(hasLabel("Tower")).// until you reach a tower4repeat(outE("send").// traverse out to a send edge5as("send_edge").// label it "send_edge"6where(eq("send_edge")).// filter: an equality test7by(loops()).// an edge passes if loops() is equal to8by("timestep").// the timestep on the edge9inV().// walk to adjacent vertex10as("visited")).// label it "visited"11as("tower").// guaranteed tower; label it "tower"12path().// path from "start" to "tower"13by(coalesce(values("tower_name",// for the even position elements14"sensor_name"))).// get vertex's ID based on its label15by(values("timestep"))// for the odd position elements: time
以下是示例 7-5的结果;我们省略了labels有效载荷来自path()对象:
And here are the results of Example 7-5; we omitted the labels payload from the path() object:
{...,"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","4","103318117","5","Bellevue"]}
{...,"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","4","103318117","5","Bellevue"]}
让我们来看看示例 7-5中的步骤。第 2 行用起始顶点填充遍历管道。第 3 行到第 9 行通过访问传出边然后访问传入顶点,设置从传感器到任何塔的递归行走。第 6、7 和 8 行定义了边的过滤器。如果过滤器等于循环计数器,则遍历器通过此过滤器。如果边的过滤器不等于循环计数器,timestep则遍历器无法通过此过滤器。timestep
Let’s walk through the steps in Example 7-5. Line 2 fills the traversal pipeline with a starting vertex. Lines 3 through 9 set up our recursive walk from the sensor to any tower by accessing outgoing edges and then incoming vertices. Lines 6, 7, and 8 define a filter for an edge. A traverser passes through this filter if its timestep is equal to the loop counter. A traverser fails to pass through this filter if the edge’s timestep is not equal to the loop counter.
唯一会通过此递归循环和过滤器的遍历器将是形成从起始传感器到塔的有效路径的遍历器。我们使用相同的模式来格式化路径结果,并确认我们找到了从传感器104115939到Bellevue塔的唯一有效路径。
The only traverser that will pass through this recursive loop and the filter will be the traverser that forms a valid walk from the starting sensor to the tower. We use the same pattern to format the path results and confirm that we found the only valid walk from sensor 104115939 up to the Bellevue tower.
使用示例 7-5where().by()中的模式可能给你一个惊喜。
Using the where().by() pattern in Example 7-5 was probably a surprise to you.
我们想向您展示人们尝试解决此问题的常用方法,然后解释为什么它不起作用,以帮助您从 Gremlin 查询语言中理解更深层次的主题。
We would like to show you a common way people try to solve this problem and then explain why it doesn’t work, to help you understand a deeper topic from the Gremlin query language.
大多数人会首先使用has("timestep", loops())边缘过滤器。我们将在示例 7-6中看一下如何使用它,然后解释为什么它是错误的。
Most people would start by using has("timestep", loops()) as a filter on the edges. We will take a look at using it in Example 7-6 and then we will explain why it is wrong.
示例 7-6中的查询不能准确回答本章的问题。它仅用于教育目的。
The query in Example 7-6 doesn’t accurately answer the question for this chapter. It is included for educational purposes.
1g.V(sensor).2until(hasLabel("Tower")).3repeat(outE("send").as("send_edge").4has("timestep",loops()).// this does not work; details in text5inV().as("visited")).6as("tower").7path().8by(coalesce(values("tower_name","sensor_name"))).9by(values("timestep"))
1g.V(sensor).2until(hasLabel("Tower")).3repeat(outE("send").as("send_edge").4has("timestep",loops()).// this does not work; details in text5inV().as("visited")).6as("tower").7path().8by(coalesce(values("tower_name","sensor_name"))).9by(values("timestep"))
示例 7-6的结果如下;我们省略了对象labels中的有效负载path():
The results of Example 7-6 follow; we omitted the labels payload from the path() object:
{...,"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","4","103318117","5","Bellevue"],...,"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","0","103318117",//incorrectresult:timeisoutoforder:3,0,5"5","Bellevue"]},...
{...,"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","4","103318117","5","Bellevue"],...,"objects":["104115939","0","10330844","1","126951211","2","127620712","3","103318129","0","103318117",//incorrectresult:timeisoutoforder:3,0,5"5","Bellevue"]},...
示例 7-6的结果与示例 7-2 的结果完全匹配。这是因为使用了has("timestep", loops())重载,并且每个遍历器都会遍历所有边。
The results for Example 7-6 exactly match the results from Example 7-2. This is because the use of has("timestep", loops()) is overloaded, and every traverser passes for all edges.
我们正在犯的错误这里我们问的是“是否loops()可以访问”,而不是“的值是否与 边缘上的属性loops()的值匹配?”timestep
The mistake we are making here is that we are asking the question “Is loops() accessible,” instead of “Does the value of loops() match the value of the timestep property on the edge?”
让我们深入研究一下原因。
Let’s dig in and see why.
示例 7-6has()中的步骤的使用 创建了一个具有结构 的过滤器。使用此结构,步骤 创建一个从属性值 开始的遍历。如果遍历器存活,则边将通过过滤器。确定遍历器是否存活的条件是 ,这将始终有效,因为将返回一个值。has(key, traversal)has()timestephas()loops()loops()
The use of the has() step in Example 7-6 creates a filter with the structure has(key, traversal). With this structure, the has() step creates a traversal that starts from the property value timestep. The edge will pass through the has() filter if the traverser survives. The condition that determines whether a traverser survives is loops(), which will always work because loops() will return a value.
本质上,我们创建了示例 7-6has(True)中的逻辑。
Essentially, we created the logic of has(True) in Example 7-6.
has(key, traversal)在帮助 Gremlin 用户编写递归查询时,我们发现最常见的错误之一是重载使用。我们希望这可以帮助您避免犯同样的错误。
The overloaded use of has(key, traversal) is one of the most common mistakes we find when helping Gremlin users write recursive queries. We hope this helps you avoid making that same mistake.
如果has("timestep", loops())不起作用,为什么该where().by()模式有效?
If has("timestep", loops()) doesn’t work, why does the where().by() pattern work?
让我们深入探讨一下原因。
Let’s dig into why.
在示例 7-5中,我们使用以下 Gremlin 模式创建边缘过滤器:
In Example 7-5, we used the following Gremlin pattern to create an edge filter:
where(eq("sendEdge")).by(loops()).by("timestep")
where(eq("sendEdge")).by(loops()).by("timestep")
where()Gremlin 中的基本形式是where(a, pred(b))。我们的用法应用 的简写where(pred(b)),其中传入的遍历器被隐式分配给a。
The basic form of where() in Gremlin is where(a, pred(b)). Our usage applies the shorthand of where(pred(b)), in which the incoming traverser is implicitly assigned to a.
由于传入的遍历器被标记为sendEdge,因此您实际上有:
Since the incoming traverser was labeled sendEdge, you actually have:
where("sendEdge",eq("sendEdge"))
where("sendEdge",eq("sendEdge"))
by()仅当您使用两个不同的调制器,然后分别应用于sendEdge和时,此模式才会被评估为错误eq("sendEdge")- 或者在这种情况下,当by()调制器从同一边缘发出两个不同的值时。
This pattern will only ever evaluate false if you use two different by() modulators, which are then applied to sendEdge and eq("sendEdge"), respectively—or in this case, when the by() modulators emit two different values from the same edge.
我们的两个by()调制器分别发出loops()和timestep的值。如果这些值不同,则表达式的计算结果为 false,并消除传入的遍历器。
Our two by() modulators are emitting the values for loops() and timestep, respectively. If those values are different, the expression evaluates to false and the incoming traverser is eliminated.
至此,我们已经完成了从传感器走到塔楼所需的所有概念的探索。本示例的最后一个步骤是:我们回到树顶,然后从塔楼走到传感器。
At this point, we have completed exploring all concepts required for walking from sensors up to towers. Last up for this example: we go back to the top of our trees and walk from the towers down to the sensors.
本章的最后一个技术部分使用传感器网络数据来避免从塔走到传感器时出现的分支因子问题。这里的查询应用边的排序顺序来导航特定的边并解决我们在第 6 章send中得出的错误。
The final technical section of this chapter uses the sensor network data to avoid the branching factor issues as we walk from towers to sensors. The queries here apply the sorted order of send edges to navigate specific edges and solve the error we concluded with in Chapter 6.
让我们从上一章探索的塔开始回答我们的第一个问题。
Let’s start with the tower we explored in the last chapter to answer our first query.
对于这个问题,我们想检查Georgetown塔,看看它收到了多少条消息,以及每条消息是在什么时候收到的。和往常一样,我们想构造一个 JSON 对象,显示哪个传感器发送了该消息以及发送时间。让我们看看示例 7-7中的查询,然后看看一些结果。
For this question, we want to inspect the Georgetown tower and see how many messages it received and at what time it received each message. As always, we want to construct a JSON object that shows which sensor sent it and at what time. Let’s look at the query in Example 7-7 and then at some results.
1tower=dev.V().has("Tower","tower_name","Georgetown").next()2g.V(tower).3inE("send").4project("Label","Name","Time").// create a map for each edge5by(outV().label()).// value for the first key "Label"6by(outV().// value for the second key "Name"7coalesce(values("tower_name"),// if a tower, return tower_name8values("sensor_name"))).// else, return sensor_name9by(values("timestep"))// value for the third key "Time"
1tower=dev.V().has("Tower","tower_name","Georgetown").next()2g.V(tower).3inE("send").4project("Label","Name","Time").// create a map for each edge5by(outV().label()).// value for the first key "Label"6by(outV().// value for the second key "Name"7coalesce(values("tower_name"),// if a tower, return tower_name8values("sensor_name"))).// else, return sensor_name9by(values("timestep"))// value for the third key "Time"
以下是示例 7-7的结果:
And here are the results of Example 7-7:
{"Label":"Sensor","Name":"1302832","Time":"3"},{"Label":"Sensor","Name":"1002688","Time":"2"},...,{"Label":"Sensor","Name":"1306931","Time":"1"}
{"Label":"Sensor","Name":"1302832","Time":"3"},{"Label":"Sensor","Name":"1002688","Time":"2"},...,{"Label":"Sensor","Name":"1306931","Time":"1"}
此示例遵循了我们用于大多数查询的相同构造模式project()。让我们逐行介绍一下此查询的作用。
This example follows the same construction pattern with project() that we have been using for most of our queries. Let’s walk through what this query is doing, one line at a time.
在示例 7-7的第 2 行,我们用一个顶点填充遍历:Georgetown塔。第 3 行将一个遍历器拆分为多个遍历器;七条相邻边各一个遍历器。这意味着Georgetown塔的分支因子为 7,现在我们的管道中有 7 个遍历器需要处理。第 4 行到第 9 行告诉每个遍历器如何将必要的数据报告回结果负载。我们在第 4 行创建一个包含键Label、Name和 的映射。第 5 行用传出顶点的标签Time填充键。第 6 行到第 8 行用传出顶点的分区键填充键。最后,第 9 行用边的 填充键。LabelNameTimetimestep
On line 2 of Example 7-7, we populate our traversal with one vertex: the Georgetown tower. Line 3 splits the one traverser into many traversers; one traverser for each of the seven adjacent edges. This means that the Georgetown tower has a branching factor of 7, and we now have seven traversers to process in our pipeline. Lines 4 through 9 tell each traverser how to report back the necessary data into the result payload. We create a map with the keys Label, Name, and Time on line 4. Line 5 fills the key Label with the label of the outgoing vertex. Lines 6 through 8 fill the Name key with the partition key from the outgoing vertex. Last, line 9 fills the Time key with the edge’s timestep.
我们多次使用此模式来构建图数据的 JSON 负载。希望这能成为 Gremlin 中用于塑造查询结果的有用步骤。
We have used this pattern multiple times to construct JSON payloads of our graph data. Hopefully this is becoming ingrained as a useful Gremlin step for shaping query results.
不管你信不信,本章我们只剩下一个问题了。我们想从塔楼走下来Georgetown,找到通往传感器的有效路径。
Believe it or not, we have only one more question to ask for this chapter. We want to walk from the Georgetown tower to find valid paths down to sensors.
对于此查询,我们必须定义我们想要从哪里开始。我们对示例 7-7的结果的检查表明,我们可以找到以timestep 3、2或结尾的树1。让我们看看以 结尾的树timestep 3。
For this query, we have to define where we want to start in time. Our examination of the results of Example 7-7 shows that we can find trees that end at timestep 3, 2, or 1. Let’s look at trees that ended at timestep 3.
对于此查询,我们首先要用伪代码勾勒出我们的方法,如示例 7-8所示。
For this query, we are first going to sketch out our approach in pseudocode, as shown in Example 7-8.
问题:我们能找到从乔治敦到所有传感器的哪些有效路径?
过程:
初始化计数器变量
总共计数器 + 1 次(考虑到第零个边缘),
执行以下操作:
走到传入的发送边缘
创建一个过滤器来将边的时间步长与计数器进行比较
将计数器减 1
显示并塑造从塔到终端传感器的路径Question: What Valid Paths Can We Find from Georgetown Down to All Sensors?
Process:
Initialize a counter variable
For a total of counter + 1 times (to account for the zero-th edge),
Do the following:
Walk to incoming send edges
Create a filter to compare an edge's timestep with the counter
Decrease the counter by 1
Show and shape the path from the tower to the ending sensor要编写这种类型的查询,我们需要深入研究一个新的 Gremlin 概念:sack()运算符。
To write this type of query, we need to dive into a new Gremlin concept: the sack() operator.
当我们从塔楼向下穿过树的不同层时,我们需要一个数据结构来跟踪我们走了多少步。在示例 7-5中,我们使用了loops()step。Loops()增量为 1,但我们需要为每个深度减少 1。
As we walk from a tower down through different levels of the tree, we want a data structure that tracks how many steps we have taken. In Example 7-5, we used the loops() step. Loops() increments by one, but we need to decrease by one for each depth.
我们需要一些不同的东西。
We need something different.
我们可以使用步骤自定义 Gremlin 遍历中的变量sack()。您可以将“背包”步骤视为在每个遍历器开始图数据之旅时为其提供一个背包。您可以用任何您想要的东西初始化背包。当遍历器在图数据中移动时,它可以根据从图数据中处理的内容改变背包的内容。
We can customize a variable in a Gremlin traversal with the sack() step. You can think of the sack step as giving each traverser a backpack at the beginning of its journey in your graph data. You can initialize the sack with whatever you would like. As your traverser moves through graph data, it can mutate the contents of its sack according to what it is processing from the graph data.
Sack()Sack()遍历器可以包含一个称为 sack 的本地数据结构。sack()-step 用于读取和写入遍历器的 sack。
A traverser can contain a local data structure called a sack. The sack()-step is used to read and write to a traverser’s sack.
WithSack()WithSack()该withSack()步骤用于初始化 sack 数据结构。
The withSack() step is used to initialize the sack data structure.
在下一个查询中,我们将从 开始timestep 3并遍历边,边timestep值为2、1和0。您可以将其更改为任何开始时间以进行额外练习。我们选择start = 3在此查询中教授概念。
In the next query, we will start at timestep 3 and walk through edges with timestep values of 2, 1, and 0, respectively. You can change this to any start time for additional practice. We picked start = 3 to teach the concepts in this query.
让我们看看如何在 Gremlin 中使用repeat()withtimes()和运算符来回答我们在示例 7-8sack()中概述的伪代码。查询在示例 7-9中。
Let’s see how to use repeat() with times() and the sack() operator in Gremlin to answer the pseudocode we outlined in Example 7-8. The query is in Example 7-9.
1start=32tower=dev.V().has("Tower","tower_name","Georgetown").next()3g.withSack(start).// every traverser starts with a sack with a value of 34V(tower).as("start").// look up Georgetown5repeat(inE("send").as("send_edge").// traverse to incoming edges6where(eq("send_edge")).// create an equality filter:7by(sack()).// test if the sack() value8by("timestep").// equals the edge's timestep9sack(minus).// decrease the sack's value10by(constant(1)).// by 111outV().as("visited")).// traverse to adjacent vertex12times(start+1).// do lines 5-10 four times13as("tower").// this vertex passed all edge filters14path().// get the path to it starting from Georgetown15by(coalesce(values("tower_name",// first object in path is a vertex16"sensor_name"))).17by(values("timestep"))// second object in path is an edge
1start=32tower=dev.V().has("Tower","tower_name","Georgetown").next()3g.withSack(start).// every traverser starts with a sack with a value of 34V(tower).as("start").// look up Georgetown5repeat(inE("send").as("send_edge").// traverse to incoming edges6where(eq("send_edge")).// create an equality filter:7by(sack()).// test if the sack() value8by("timestep").// equals the edge's timestep9sack(minus).// decrease the sack's value10by(constant(1)).// by 111outV().as("visited")).// traverse to adjacent vertex12times(start+1).// do lines 5-10 four times13as("tower").// this vertex passed all edge filters14path().// get the path to it starting from Georgetown15by(coalesce(values("tower_name",// first object in path is a vertex16"sensor_name"))).17by(values("timestep"))// second object in path is an edge
让我们逐行执行查询,然后查看结果。
Let’s step through the query one line at a time and then take a look at the results.
示例 7-9中的第 1行将起始变量初始化为 3。我们将在查询中多次使用此变量。我们第一次使用该变量是在第 3 行,其中我们将遍历器的袋子初始化为 3。第 4 行用塔填充我们的遍历管道Georgetown。然后我们repeat()/times()在第 5 行和第 12 行看到了这个模式。在这里,我们用这个值start + 1作为任何遍历器的停止条件。这意味着从第 5 行到第 12 行的遍历将在start + 1 = 4迭代后完成。
Line 1 of Example 7-9 initialized a starting variable to be 3. We will use this variable multiple times in the query. The first place we use the variable is on line 3, where we initialize a traverser’s sack to be 3. Line 4 populates our traversal pipeline with the Georgetown tower. Then we see the repeat()/times() pattern on lines 5 and 12. Here, we use the value start + 1 as the stopping condition for any traverser. This means that the traversal from lines 5 through 12 will be completed after start + 1 = 4 iterations.
在 repeat 子句中,我们为处理的每条边构建一个过滤器。我们使用where()/by()与之前相同的模式。但是,这次我们loops()用替换sack(),这意味着边的timestep将与 中的值进行比较sack()。
Within the repeat clause, we construct a filter for every edge that we process. We use the same where()/by() pattern that we did before. This time, however, we replace loops() with sack(), which means that an edge’s timestep will be compared to the value in sack().
让我们了解一下sack()这个循环是如何运作的。
Let’s walk through how sack() works within this loop.
第一次在重复步骤中处理遍历时,每个遍历器都会将 3 存储在其袋子中。这意味着第一次使用第 6、7 和 8 行上的过滤器时,我们会将边与整数 3 进行比较。 只有与 相邻且值为 3 的timestep边才会通过此过滤器。Georgetowntimestep
The first time we process the traversal within the repeat step, each traverser will have 3 stored within its sack. This means that the first time we use the filter on lines 6, 7, and 8, we will compare an edge’s timestep to the integer 3. Only the edges adjacent to Georgetown with a timestep of 3 will pass through this filter.
在第 9 行,我们改变了遍历器袋子中的值。我们用sack(minus)step 减少袋子的值。by()第 10 行的调制器告诉遍历器要从袋子中减去多少。我们想减去一,所以我们使用by(constant(1))。
On line 9, we mutate the value in a traverser’s sack. We decrease the sack’s value with the sack(minus) step. The by() modulator on line 10 tells the traverser how much to subtract from the sack. We want to subtract one, so we use by(constant(1)).
在第 11 行,我们移至另一个顶点,第 12 行检查循环条件。第 14 行至第 17 行格式化路径结果,我们已经多次这样做了。示例 7-9的结果如下;我们labels从path()对象中省略了有效载荷:
On line 11, we move to the other vertex, and line 12 checks the looping condition. Lines 14 through 17 format the path results, as we have done many times. The results of Example 7-9 follow; we omitted the labels payload from the path() object:
{...,"objects":["Georgetown","3","1302832","2","1059089","1","1255230","0","1248210"],...,"objects":["Georgetown","3","1302832","2","1059089","1","1302832",//cycle"0","1010055"]},...
{...,"objects":["Georgetown","3","1302832","2","1059089","1","1255230","0","1248210"],...,"objects":["Georgetown","3","1302832","2","1059089","1","1302832",//cycle"0","1010055"]},...
敏锐的观察者会看到意想不到的结果。第二个对象包含重复的传感器,1302832即使路径遵循正确的时间值。我们需要从结果中删除循环,就像我们在第 6 章中所做的那样。
A keen observer will see an unexpected result. The second object contains a repeated sensor, 1302832, even though the path follows the correct time values. We need to remove cycles from our results, as we did in Chapter 6.
如示例 7-10所示,结果查询与之前相同,但在第 12 行增加了这个新步骤。
The resulting query, shown in Example 7-10, is the same as before, but with this new step on line 12.
1start=32tower=dev.V().has("Tower","tower_name","Georgetown").next()3g.withSack(start).4V(tower).as("start").5repeat(inE("send").as("send_edge").6where(eq("send_edge")).7by(sack()).8by("timestep").9sack(minus).10by(constant(1)).11outV().as("visited").12simplePath()).// remove cycles13times(start+1).14as("tower").15path().16by(coalesce(values("tower_name",17"sensor_name"))).18by(values("timestep"))
1start=32tower=dev.V().has("Tower","tower_name","Georgetown").next()3g.withSack(start).4V(tower).as("start").5repeat(inE("send").as("send_edge").6where(eq("send_edge")).7by(sack()).8by("timestep").9sack(minus).10by(constant(1)).11outV().as("visited").12simplePath()).// remove cycles13times(start+1).14as("tower").15path().16by(coalesce(values("tower_name",17"sensor_name"))).18by(values("timestep"))
示例 7-10的结果如下;我们省略了对象labels中的有效载荷path():
The results of Example 7-10 follow; we omitted the labels payload from the path() object:
{...,"objects":["Georgetown","3","1302832","2","1059089","1","1255230","0","1248210"]},...,"objects":["Georgetown","3","1302832","2","1059089","1","1255230","0","1280634"]}
{...,"objects":["Georgetown","3","1302832","2","1059089","1","1255230","0","1248210"]},...,"objects":["Georgetown","3","1302832","2","1059089","1","1255230","0","1280634"]}
检查结果有效载荷,我们会看到两棵不同的有效树,它们都从Georgetown塔开始。一棵树以传感器结束1248210。另一棵树以传感器结束1280634。
Inspecting the result payloads, we see two different valid trees that start from the Georgetown tower. One tree ends at sensor 1248210. The other ends at sensor 1280634.
这就是我们的查询创建!
And that is it for our query creation!
我们成功解决了第 6 章末尾的错误,并且能够在示例数据中来回走动叶子和根。
We have successfully addressed the errors from the end of Chapter 6 and are able to walk to and from the leaves and roots in our example data.
作为 Edge Energy 的数据工程师,您的最终任务是应用您所构建的内容来解决 Edge Energy 的更大问题:关机或塔故障对网络有何影响?
As a data engineer for Edge Energy, your final task is to apply what you have built to address Edge Energy’s larger problem: what is the impact of a shutdown or tower failure on the network?
理解数据和图技术的艺术源于集成多个组件来解决复杂问题。在过去的两章中,我们一直在设置数据、模式和查询来实现这一点:利用数据中的关系来洞察网络的动态和不断发展的拓扑结构。
The art of understanding your data and graph technology derives from integrating multiple components to solve complex problems. Over the past two chapters, we have been setting up data, schema, and queries to do just that: use the relationships within our data to provide insights into a network’s dynamic and evolving topology.
那么,我们如何整合过去两章的结果来解决 Edge Energy 的复杂问题呢?我们使用已建立的工具来分解公司的复杂问题。
So how do we integrate our results over the past two chapters to resolve Edge Energy’s complex problem? We break down the company’s complex problem using the tools we have set up.
我们已经在Georgetown塔周围查询了一段时间了。让我们重新回顾一下第 6 章中看到的图像,并思考一下如果塔倒塌会造成什么影响。图 7-14Georgetown中的图像以橙色显示了塔。绿色动脉都是附近的传感器。蓝色菱形是其他附近的塔。Georgetown
We have been querying around the Georgetown tower for a while now. Let’s revisit the image we saw in Chapter 6 and think about the impact if the Georgetown tower were to fail. The image in Figure 7-14 shows the Georgetown tower in orange. The green arteries are all nearby sensors. The blue diamonds are other nearby towers.
Georgetown想象一下如果塔倒塌会发生什么。我们会失去与哪些传感器的连接?它们是否仅限于塔周围的传感器?
Consider what would happen if the Georgetown tower were to go down. Which sensors, if any, will we lose connection to? Will they be only the sensors that surround the tower?
让我们查询一下图,让数据告诉我们会发生什么。我们已经确定了两种工具来回答这个问题:
Let’s query our graph and let the data tell us what would happen. We have ironed out two tools to use to answer this question:
对于任何一座塔,我们可以报告与其通信的所有传感器。
We can report, for any tower, all of the sensors that communicated with it.
对于任何传感器,我们都可以分辨出与哪些塔相连。
For any sensor, we can tell which towers connected with.
为了解决 Edge Energy 复杂的网络故障问题,我们可以对塔应用以下程序Georgetown:
To resolve Edge Energy’s complex network failure problem, we can apply the following procedure for the Georgetown tower:
Georgetown获取任意时间窗口内连接的传感器列表。
Get a list of sensors that connected with Georgetown in any time window.
对于每个传感器,查询网络以查看它们是否在该时间窗口内使用不同的塔。
For each sensor, query the network to see if they used a different tower in that time window.
让我们使用已经构建的查询来回答每个问题。
Let’s answer each question using the queries we already built.
例 7-11展示了我们在随附的 Studio Notebook中所做的事情:
Example 7-11 shows what we did in the accompanying Studio Notebook:
问题:获取任意时间窗口内与 Georgetown 连接的传感器列表
过程:
将我们从塔到传感器的查询包装到一个方法中:getSensorsFromTower()
对于每个时间步骤:
查找与 Georgetown 连接的所有传感器
创建传感器的唯一列表Question: Get a list of sensors that connected with Georgetown in any time window
Process:
Wrap our query from a tower to sensors in a method: getSensorsFromTower()
For each step in time:
Find all sensors that connected with Georgetown
Create a unique list of the sensorsThe code for the pseudocode in Example 7-11 is shown in Example 7-12.
// wrap our query of valid paths in a method called getSensorsFromTowerdefgetSensorsFromTower(g,start,tower){sensors=g.withSack(start).V(tower).repeat(inE("send").as("sendEdge").where(eq("sendEdge")).by(sack()).by("timestep").sack(minus).by(constant(1)).outV().simplePath()).times(start+1).values("sensor_name").toList()returnsensors;}atRiskSensors=[]asSet;// create a list of sensorstower=g.V().has("Tower","tower_name","Georgetown").next();for(time=0;time<6;time++){// loop through a window of time// all sensors into Georgetown's list at this time via getSensorsFromTower()atRiskSensors.addAll(getSensorsFromTower(g,time,tower));}
// wrap our query of valid paths in a method called getSensorsFromTowerdefgetSensorsFromTower(g,start,tower){sensors=g.withSack(start).V(tower).repeat(inE("send").as("sendEdge").where(eq("sendEdge")).by(sack()).by("timestep").sack(minus).by(constant(1)).outV().simplePath()).times(start+1).values("sensor_name").toList()returnsensors;}atRiskSensors=[]asSet;// create a list of sensorstower=g.V().has("Tower","tower_name","Georgetown").next();for(time=0;time<6;time++){// loop through a window of time// all sensors into Georgetown's list at this time via getSensorsFromTower()atRiskSensors.addAll(getSensorsFromTower(g,time,tower));}
示例 7-12的主要结果是对象atRiskSensors。这是与塔有有效通信路径的所有传感器的列表Georgetown。前四个传感器是:
The main result from Example 7-12 is the object atRiskSensors. This is a list of all sensors that had valid communication paths with the Georgetown tower. The first four sensors are:
"1302832","1059089","1290383","1201412",...
"1302832","1059089","1290383","1201412",...
为了向 Edge Energy 提供主动信息,我们还需要知道最后一件事。我们需要知道处于危险中的传感器与哪些其他塔进行通信。
There is one last thing we need to know to provide proactive information to Edge Energy. We need to know which of the other towers the at-risk sensors communicated with.
示例 7-13展示了我们在随附的 Studio Notebook中所做的事情。
Example 7-13 shows what we did in the accompanying Studio Notebook.
问题:对于每个存在风险的传感器,找到与其通信的所有发射塔
过程:
将我们从传感器到塔的查询包装到一个方法中:getTowersFromSensor()
对于 atRiskSensors 中的每个传感器:
对于每个时间步骤:
找到传感器连接的塔
在地图上添加传感器所连接的独特塔
查找仅连接到 Georgetown 的传感器Question: For each at-risk sensor, find all towers it communicated with
Process:
Wrap our query from a sensor to towers in a method: getTowersFromSensor()
For each sensor in atRiskSensors:
For each step in time:
Find the towers the sensors connected with
Add to a map of the unique towers a sensor connected to
Find sensors that connected only to Georgetown当我们分析数据中的所有路径时,我们最终要寻找唯一连接到的传感器。示例 7-13Georgetown中的伪代码显示在示例 7-14中。
As we analyze all of the paths in our data, we ultimately are looking for sensors that uniquely connected to Georgetown. The code for our pseudocode in Example 7-13 is shown in Example 7-14.
// wrap our query of valid paths in a method called getTowersFromSensordefgetTowersFromSensor(g,start,sensor){towers=g.withSack(start).V(sensor).until(hasLabel("Tower")).repeat(outE("send").as("sendEdge").where(eq("sendEdge")).by(sack()).by("timestep").inV().sack(sum).by(constant(1))).values("tower_name").dedup().toList()returntowers;}otherTowers=[:];// create a mapfor(i=0;i<atRiskSensors.size();i++){// loop through all sensorsotherTowers[atRiskSensors[i]]=[]asSet;// initialize the map for a sensorsensor=g.V().has("Sensor","sensor_name",atRiskSensors[i]).next();for(time=0;time<6;time++){// loop through a window of time// use getTowersFromSensor to add all towers// into the map for this sensor at this timeotherTowers[atRiskSensors[i]].addAll(getTowersFromSensor(g,time,sensor));}}
// wrap our query of valid paths in a method called getTowersFromSensordefgetTowersFromSensor(g,start,sensor){towers=g.withSack(start).V(sensor).until(hasLabel("Tower")).repeat(outE("send").as("sendEdge").where(eq("sendEdge")).by(sack()).by("timestep").inV().sack(sum).by(constant(1))).values("tower_name").dedup().toList()returntowers;}otherTowers=[:];// create a mapfor(i=0;i<atRiskSensors.size();i++){// loop through all sensorsotherTowers[atRiskSensors[i]]=[]asSet;// initialize the map for a sensorsensor=g.V().has("Sensor","sensor_name",atRiskSensors[i]).next();for(time=0;time<6;time++){// loop through a window of time// use getTowersFromSensor to add all towers// into the map for this sensor at this timeotherTowers[atRiskSensors[i]].addAll(getTowersFromSensor(g,time,sensor));}}
示例 7-12的主要结果是对象otherTowers。这是hashMap从起始传感器开始具有有效通信路径的所有唯一塔的集合。让我们看一下中的前几个条目otherTowers。
The main result from Example 7-12 is the object otherTowers. This is a hashMap of all unique towers that had valid communication paths from the starting sensor. Let’s take a look at the first few entries in otherTowers.
{"1035508":["Georgetown","WhiteCenter","RainierValley"]},{"1201412":["Georgetown","Youngstown"]},{"1255230":["Georgetown"]},...
{"1035508":["Georgetown","WhiteCenter","RainierValley"]},{"1201412":["Georgetown","Youngstown"]},{"1255230":["Georgetown"]},...
示例 7-15将前两章中的所有内容整合到一个有效载荷中。我们将这些数据解释为,1035508如果Georgetown发生故障,则有两个其他选项:WhiteCenter或RainierValley。但是,对于我们研究的时间窗口,1255230传感器处于危险之中,因为它仅Georgetown在我们研究的时间窗口内与 进行通信。
Example 7-15 brings everything from the past two chapters together into one payload. We interpret this data to mean that 1035508 has two other options in the event that Georgetown fails: WhiteCenter or RainierValley. However, for the time window we looked at, 1255230 is a sensor at risk because it communicated only with Georgetown during the time window we studied.
我们在图 7-15中将示例 7-15中的所有处于风险中的传感器可视化。
We visualize all at-risk sensors from Example 7-15 in Figure 7-15.
图 7-15中的地图直观地展示了该塔的网络故障场景Georgetown。该Georgetown塔显示为红色。 仅在特定时间窗口内通信的所有传感器Georgetown显示为橙色。成功与其他附近塔通信的所有传感器显示为绿色。其他塔显示为蓝色菱形。
The map in Figure 7-15 visualizes a network failure scenario for the Georgetown tower. The Georgetown tower is shown in red. All sensors that communicated only with Georgetown during a particular time window of interest are shown in orange. All sensors that successfully communicated with other nearby towers are shown in green. The other towers are shown as blue diamonds.
让我们回顾一下,了解一下我们所处的位置。
Let’s step back a bit to understand where we are.
Georgetown tower;为了图像清晰起见,未显示网络边缘我们的目标是与 Edge Energy 团队展开积极对话。我们可以利用这些结果、数据及其在网络中可观察到的关系来确定 Edge Energy 的下一步行动。
What we have built toward is the start of a proactive conversation with the Edge Energy team. We can take these results, the data, and its observable relationships within the network to determine Edge Energy’s next step.
橙色传感器并非向 Edge Energy 报告的故障。它们代表处于危险中的传感器。检查图 7-13中的地理位置可以发现,每个处于危险中的传感器都可以连接许多附近的传感器和塔。只有通过一段时间内的额外观察,Edge Energy 才能完全了解网络中任何传感器的个体风险。
The orange sensors are not failures to report back to Edge Energy. They represent sensors that are at risk. Examining the geo-location from Figure 7-13 reveals that there are many nearby sensors and towers that each at-risk sensor can connect with. Only through additional observations over time will Edge Energy be able to fully understand any sensor’s individual risk in the network.
利用分布式图技术,我们正在帮助 Edge Energy 监控其网络。它可以利用该图拓扑的不断演变的结构来主动应对不同的网络故障情况。
With distributed graph technology, we are helping Edge Energy monitor its network. It can use the evolving structure of this graph’s topology to be proactive about different network failure scenarios.
我们在第6章和第 7章中的工作探索了来自自组织传感器网络的时间序列数据的层次结构,以便我们能够解决有关 Edge Energy 动态网络的复杂问题。我们将查询和对数据的理解结合起来,帮助 Edge Energy 解决一个复杂的问题:如何使用图中的时间序列数据来主动应对网络故障。
Our work over Chapters 6 and 7 explored the hierarchical structure of time series data from a self-organizing network of sensors so that we could solve a complex problem about Edge Energy’s dynamic network. We stitched together our queries and understanding of the data to help Edge Energy with a complex problem: how to use time series data in a graph to be proactive about network failures.
谁知道穿过树林也会像在公园散步一样轻松?
Who knew that traversing through trees could be a walk in a park?
如果您还没有这样做,我们建议您亲自体验这一切。随附的 Studio Notebooks(位于https://oreil.ly/graph-book)介绍了每个查询,以及这些章节中未提及的一些额外项目。
If you haven’t already done so, we recommend you take all of this for a drive yourself. The accompanying Studio Notebooks, found at https://oreil.ly/graph-book walk through each of these queries, with a few more bonus items not mentioned in these chapters.
到目前为止,我们在本书中介绍了分布式系统中两种最流行的图模型的数据模型和查询:邻域和层次结构。下一章将介绍另一种流行的数据模式。我们将介绍并使用第三种最流行的分布式图应用程序数据模型和查询:网络路径。
So far in this book, we have covered the data models and queries for two of the most popular graph models in distributed systems: neighborhoods and hierarchies. The next chapter introduces another popular data pattern. We will be introducing and using the third most popular data model and queries for distributed graph applications: network paths.
继邻域检索和无界层次结构之后,图数据中的路径查找是图技术的第二大流行用途。
Pathfinding in graph data is the next most popular use of graph technology, after neighborhood retrieval and unbounded hierarchies.
除了为本书采访世界各地的图谱用户外,我们还花了大量时间与他们合作。我们的工作重点通常是寻找图谱数据中的未知路径。
In addition to interviewing graph users around the world for this book, we also spent a significant amount of time working with them. More often than not, our working sessions centered on finding unknown paths within graph data.
在其中一次工作会议中,我们正在培训一个团队使用流行的路径查找技术。我们使用机场之间的飞行路径图来推断城市之间的飞行模式。1 我们从两个关于航空旅行的最常见问题开始练习:从这个特定机场有多少个直达航班?有多少个机场在两次飞行内可以到达?
During one of those working sessions, we were training a team on popular pathfinding techniques. We were using a graph of flight paths between airports to reason about flight patterns between cities.1 We started our exercise with the two most popular questions about air travel: how many direct connections are there from this specific airport? And how many airports are reachable within two flights?
研讨会期间的故障排除讨论使我质疑人们如何使用路径信息做出明智的决定。
The troubleshooting discussion during the workshop led me to question how people use path information to make an informed decision.
One particularly interesting implication is related to trust.
你如何决定是否信任某人?你信任你的朋友。而且你可能更信任朋友的朋友,而不是陌生人。这是为什么呢?
How do you decide if you trust somebody? You trust your friends. And you probably trust friends of your friends more than you trust a random stranger. Why is that?
正是你对自己和其他事物之间不同路径的信任,才激发并形成了你的偏好。
It is your trust in different paths between you and something else that motivates and informs your preferences.
本章主要分为四节。
There are four main sections of this chapter.
我们将首先介绍一些如何使用路径来量化信任的例子。然后,我们将首先概述处理图数据中的路径所需的数学和计算机科学概念。之后,我们将设置本章的示例,在该示例中,我们将使用、查询和查找整个比特币信任网络中的路径,以回答基本问题:在与某人互动之前,您应该在多大程度上信任他?本章的最后一部分将路径查询应用于比特币信任网络。我们将从探索和理解数据中的信任开始。然后,我们将向您展示如何使用路径查询来决定是否信任特定的比特币钱包。
We’ll first cover some more examples of how we all use paths to quantify trust. Then we’ll start with an overview of the required concepts from mathematics and computer science for working with paths in graph data. After that, we’ll set up this chapter’s example, in which we’ll be working with, querying, and finding paths throughout the Bitcoin trust network to answer the fundamental question: how much should you trust someone before you interact with them? The final section of this chapter applies path queries to the Bitcoin trust network. We will start with exploring and understanding trust within the data. Then we’ll show you how to use path queries to inform a decision about whether to trust a particular Bitcoin wallet.
我们将以对信任的数学量化来结束本章,这将引出我们将在下一章中解决的问题。
We’ll conclude the chapter with a mathematical quantification of trust that leads to a problem we’ll solve in the next chapter.
使用数据量化信任的主题超出了前面提到的航空旅行的例子。图数据中信任和路径之间的相关性适用于我们与全球客户合作的几乎所有路径应用。
The theme of using data to quantify trust extends beyond the air travel example previously mentioned. The correlation between trust and paths in graph data applies to almost all of the path applications we work on with our customers around the world.
我们从人们如何使用社交媒体、侦探如何建立刑事案件以及物流优化中看到了这一点。
We have seen this in how people use social media, in how detectives build criminal cases, and in logistics optimizations.
Think about the social media platform that you use most regularly.
您如何确定是否要接受该联系人、关注者或好友的请求?
How do you determine whether you are going to accept that connection, follower, or friend request?
如果你和大多数人一样,在寻求新联系时,你会经历一个非常常见的过程。通常,你首先会查看你和潜在的新朋友、联系人或关注者之间的共同联系。图 8-1提供了你可能寻找的潜在联系的图。
If you are like most people, you undertake a very common process for new connection requests. Typically, you first look at the shared connections between you and the potential new friend, connection, or follower. Figure 8-1 offers a graph of the possible connections you might look for.
你可能会问自己,“我和这个人有多少共同的朋友?”是 3 个还是 30 个?然后你看看这些共同联系的质量。你最亲密的朋友或家人是否在共同联系的名单上?你们的共同联系是否都来自你生活中的某个特定时刻,比如某份工作或学校?
You likely ask yourself, “How many friends do I have in common with the person?” Is it 3 shared friends or 30? Then you look at the quality of those shared connections. Are any of your closest friends or family members in the list of shared connections? Are your shared connections all from a specific point in your life, like a particular job or school?
您的分析包括了解您共享的联系的数量和质量。您正在使用您与新联系人之间的路径来了解和告知您如何认识该人。最终,您对这些路径的信任会引导您接受或拒绝该新联系。
Your analysis consists of walking through the quantity and quality of your shared connections. You are using the paths between you and the new connection to contextualize and inform how you know that person. Ultimately, it is your trust in those paths that leads you to accept or reject that new connection.
在社交媒体上接受新的联系从最短路径开始,然后自然演变为这些路径的质量和背景。
Accepting a new connection on social media starts with your shortest path and then naturally evolves into the quality and context of those paths.
社交媒体帮助我们量化我们对新人的信任程度。我们利用共同的联系来构建一个故事,讲述我们如何认识某人,从而决定我们是否信任他们。
Social media helps us quantify how much we trust anyone new. We use our shared connections to construct a story about how we know someone and therefore whether we trust them.
这可能是我们现在每次与网络互动时都会自然而然地做的事情。但这并不是第一次使用这种技术。调查人员长期以来一直在使用可靠的消息来源在两个之前不相干的人之间建立联系。
It may be something that we now do naturally every time we engage with our networks. But this isn’t the first use of this technique. Investigators have been using trusted sources to create connections between two previously disconnected individuals for a long time.
漫长的刑事调查历史、不断增长的数据量以及图技术新兴模式,为量化跨数据关系的信任提供了完美的环境。
The long history of criminal investigations, together with rising volumes of data and emerging patterns of graph technology, serves as the perfect environment for quantifying trust in relationships across data.
侦探的工作是汇集信息源,以了解两个人之间的联系。侦探通过传唤与案件相关的数据源来获取记录。然后,调查人员统一数据源并直接在未结案件中搜索未知联系。图 8-2显示了一些数据源的图。该图描绘了我们为侦探的故事想出的内容,但您也应该从概念上思考这一点。
A detective’s work is to pull together sources of information to understand how two individuals are connected. Detectives obtain access to records by subpoenaing data sources related to the case. Then investigators unify the data sources and directly search for unknown connections within their open case. Figure 8-2 shows a graph of some of these data sources. The figure depicts what we came up with for a detective’s story, but you should think about this conceptually, too.
在刑事调查中,通过数据路径来描述两个个体之间的关联,从而找出其中的关联,讲述发生的事情。调查人员报告的信息受法律管辖;他们必须相信构成故事的关联质量。
Drawing correlations about the connections between two individuals in a criminal investigation uses paths through data to tell stories about what happened. The investigators are reporting information governed by law; they have to trust the quality of the connections that construct the story.
在不太严重的范围内,您在决定个人航班计划时也会进行相同类型的调查。您根据所购航线的背景和质量来决定您的航空旅行,就像调查人员得出案件结论一样。
On a less serious scale, you do the same type of investigations when you make a decision regarding your personal flight schedule. You make decisions about your air travel based on the context and quality of the route you purchase in the same way investigators derive conclusions about a case.
让我们看一下使用网络中的路径来量化信任的第三个例子。
Let’s look at a third example of using paths in networks to quantify trust.
物流公司可能会寻求最大限度地降低运输路线上的成本和时间。作为最小化的一部分,它可能会考虑包裹在仓库和您家门口之间转移的次数。更少的转移次数可以最大限度地减少包裹丢失或放错地方的机会。我们在图 8-3中绘制了一个表示此网络的图。
A logistics company might seek to minimize costs and time along its delivery routes. As part of that minimization, it may consider the number of times a package has to be transferred between the warehouse and your front door. Fewer transfers minimizes the number of chances for a package to be lost or misplaced. We drew a graph that represents this network in Figure 8-3.
图 8-3描绘了包裹从仓库到您家的运输过程。您会看到三条潜在路径,每条路径的长度和转运类型组合都不同。根据多种因素,我们物流网络中的一条路径可能比另一条路径更可靠。
Figure 8-3 depicts how a package travels from a warehouse to your home. You see three potential paths, each with different combinations of length and types of transfer. Depending on a multitude of factors, one path in our logistics network may be more trusted than another.
例如,如果你经常关注包裹的运输路径,你也会感受到运输路线优化的影响。包裹经过的站点越多,你对它准时到达的信任度就越低。
For example, if you are someone who watches your package’s path, you also feel the effect of route optimization for shipping. The more stops you see your package take, the lower your trust in its on-time arrival.
路线优化是计算机科学中图最流行的用途之一。无论您是在制定个人旅行决定还是在等待包裹,最值得信赖的解决方案都会通过数据寻求最短路径。
Route optimization is one of the most popular uses of graphs in computer science. Whether you are making decisions for personal travel or waiting for a package, the most trusted solutions seek the shortest path through the data.
重要的是源和目的地之间路径的信任。
It is the trust in the path between the source and the destination that matters.
通过理解共享关系来量化两个概念之间的信任(可能是)当今分布式图技术最相关、最平易近人的应用。
Quantifying trust between two concepts through understanding shared relationships is (probably) the most relatable and approachable application of distributed graph technology today.
当您不清楚如何在图中的顶点之间行走时,寻路查询是图技术的常见用途。
Pathfinding queries are popular uses of graph technology when you do not know exactly how to walk between vertices in your graph.
然而,通过图结构发现路径可能成为一把双刃剑:一方面,使用图技术进行寻路将为您提供简短而优雅的解决方案;另一方面,简单的寻路查询可能很快就会失控。
However, discovering paths throughout graph structure may become a double-edged sword: on the one hand, pathfinding with graph technology will provide you with short and elegant solutions; on the other hand, naive pathfinding queries may quickly get out of control.
寻路问题问起来简单,但计算起来却很昂贵。这就是事情很快就会失控的地方。
Pathfinding questions are simple to ask but expensive to compute. This is where things can get out of hand very quickly.
让我们首先了解一下图结构中发现路径的基本问题定义。
Let’s start by walking through the fundamental problem definitions for discovering paths in graph structure.
In this chapter, we will introduce shortest paths according to a path’s distance.
回想一下第 2 章中距离的定义,即从一个顶点走到另一个顶点所需的最少边数。最短路径问题是在图中找出从一个顶点到另一个顶点距离最小或步行最短的路径。以下是我们将在接下来的两章中应用的四个术语及其定义。
Recall from Chapter 2 the definition of distance as the smallest number of edges it takes to walk from one vertex to another. The shortest path problem is to find the path with the smallest distance, or shortest walk, from one vertex to another in your graph. Here are the four terms and their definitions that we will be applying throughout the next two chapters.
图中的路径是图中连续边的序列。
A path in a graph is a sequence of consecutive edges in a graph.
The shortest path between two vertices is the path that connects the two vertices and has the shortest length or distance.
The distance between two vertices in a graph is the number of edges in a shortest path.
在图8-4中,从A走到D有三种方式:
In Figure 8-4, there are three ways to walk from A to D:
A → D
A → D
A→C→D
A → C → D
A → B → C → D
A → B → C → D
最短路径是距离最短的路径。即从 A 到 D 的路径,距离为 1。其他路径的距离分别为 2 和 3。
The shortest path is the path with the smallest distance. That is the path from A to D, which has a distance of 1. The other paths have distances of 2 and 3, respectively.
There are three types of shortest path problems:
The goal of a shortest path problem is to discover the smallest distance walk from A to B.
单源最短路径问题的目标是发现从 A 到图中所有其他顶点的最小距离路径。
The goal of a single-source shortest path problem is to discover the smallest-distance walk from A to all other vertices in the graph.
全对最短路径问题的目标是发现图中任意两个顶点之间的最小距离步行。
The goal of an all-pairs shortest path problem is to discover the smallest-distance walk between any two vertices in the graph.
这些定义将你可能遇到或将要遇到的三种类型的寻路问题进行了分类。本章重点介绍第一类问题的解决方案:寻找两个已知点之间的最短路径。
These definitions give us a classification of the three types of pathfinding problems that you may have run into or will run into. This chapter focuses on solutions to the first type of problem: finding the shortest path between two known points.
任何路径问题的解决方案都依赖于理解如何程序化地遍历图数据。让我们深入研究深度优先搜索 (DFS) 和广度优先搜索 (BFS),这是查找最短路径的两种基本技术。
Any solution to a path problem relies on understanding how to procedurally walk through graph data. Let’s dig into depth-first search (DFS) and breadth-first search (BFS), two fundamental techniques for finding shortest paths.
深度优先搜索是一种遍历图数据结构的算法。它会沿着每个分支尽可能深入地探索路径,然后再回溯。
Depth-first search is an algorithm for traversing graph data structures. It explores a path as deep as possible along each branch before backtracking.
Breadth-first search is an algorithm for traversing graph data structures. It explores all of the neighbor vertices at the present depth prior to moving on to the vertices at the next depth level.
你可能会想:“为什么我们需要研究 DFS 与 BFS?”
You may be wondering: “Why do we need to go into DFS versus BFS?”
首先,大多数工程师都是通过搜索有关特定寻路算法的信息来开始研究寻路的。对我们来说,这是一种倒退的方法。其次,由于路径非常容易理解,因此很容易将解决方案与潜在问题混淆。
First, most engineers start their research about pathfinding by searching for information on a certain pathfinding algorithm. To us, that is a backwards approach. Second, because paths are so natural to understand, it is easy to confuse the solution with the underlying problem.
You first need to understand which path problem you are trying to solve before you apply a certain algorithm.
深度优先搜索和广度优先搜索是说明图结构数据的过程访问的两种最流行的方法。深入了解每种技术将为您提供探索寻路算法世界所需的基础,因为在某种程度上,所有其他寻路问题的解决方案都建立在这两种技术之上。
Depth-first search and breadth-first search are two of the most popular ways to illustrate procedural visitation of graph-structured data. Diving deep into understanding each technique gives you the foundation you need to explore the world of pathfinding algorithms, because at some level, all other solutions to pathfinding problems build upon these two techniques.
两种方法之间的差异很容易理解。深度优先搜索优先尽可能深入地探索一条路径,然后再返回另一条路径。广度优先搜索优先探索一定距离内的所有路径,然后再深入数据。
The difference between the two approaches is easy to understand. Depth-first search prioritizes exploring one path as deeply as you can before returning to a different path. Breadth-first search prioritizes exploring all paths up to a certain distance before moving deeper into the data.
让我们在图 8-5中看一下这些差异。浏览该图时,主要思想是考虑每个进程访问顶点的顺序;我们称之为访问集。我们在图 8-5中用数字标记每个顶点,以表示每个算法访问(或到达)该顶点的顺序。
Let’s take a look at these differences in Figure 8-5. As you walk through the figure, the main idea is to consider the order in which a vertex is visited for each process; we call this the visited set. We numerically label each vertex in Figure 8-5 with the order in which it is visited (or reached) by each algorithm.
对于图 8-5中的每个图,目标是按程序从顶部的起始顶点走到末尾。左侧的图显示了根据 DFS 访问每个顶点的顺序。在这里,你会看到每个分支都被探索到末尾,然后你再返回顶部选择不同的路径。右侧的图显示了根据 BFS 访问每个顶点的顺序。在这里,你会看到在深入图之前,每个级别或邻域都得到了充分探索。
For each graph in Figure 8-5, the goal is to walk procedurally from the starting vertex at the top to the end. The graph on the left shows the order in which every vertex is visited according to DFS. Here, you see that each branch is explored until its end before you return back to the top to select a different path. The graph on the right shows the order in which every vertex is visited according to BFS. Here, you see that each level or neighborhood is fully explored before you move deeper into the graph.
DFS 和 BFS 之间的实现细节取决于您使用的数据结构。DFS 使用后进先出 (LIFO) 堆栈。您可以通过可视化堆栈来记住这一点。堆栈通常被认为是垂直结构,就像 DFS 在深入探索数据之前先深入探索数据一样。BFS 使用先进先出 (FIFO) 队列。您可以通过可视化队列来记住这一点。队列通常被认为是水平结构,就像 BFS 在深入探索数据之前先广泛探索数据一样。
The implementation details between DFS and BFS come down to which data structure you use. DFS uses a last in, first out (LIFO) stack. You can remember this by visualizing a stack. Stacks are typically thought of as vertical structures, just like how DFS explores data deeply before going wide. BFS uses a first in, first out (FIFO) queue. You can remember this by visualizing a queue. Queues are typically thought of as horizontal structures, just like how BFS explores widely before going deep.
无论你需要多长时间才能像图一样思考,了解处理数据所需的运行时间和开销都至关重要。因此,请继续练习如何以程序方式思考在遍历过程中需要访问多少数据。从那里,你可以量化对遍历或算法将运行多长时间或它对数据所需的开销的预期。
However long it takes you to think like a graph, it is vital that you understand the runtime and overhead required to process your data. So keep practicing how to procedurally think through how much data you need to visit during a traversal. From there, you can quantify an expectation for how long a traversal or algorithm will run or the overhead it requires for your data.
回想一下你上次使用 LinkedIn 的情形。你可能打开 LinkedIn 应用程序是为了搜索其他人。找到候选人后,您会收到一个指标,表明您与搜索结果中每个人的联系有多紧密。您可以立即知道某人是一级、二级还是三级联系人。
Think about the last time you used LinkedIn. You likely opened the LinkedIn application to search for someone else. When you found candidates, you received a metric indicating how closely connected you were to each person in your search results. You knew right away whether someone was a first-, second-, or third-degree connection.
Now think like an engineer working for LinkedIn.
在此场景中,您正在设计刚刚使用的关联徽章功能。要求 LinkedIn 的任何用户在搜索时都知道自己与其他任何人之间的距离。从这里开始,您和您的工程团队需要考虑一系列方法。
In this scenario, you are designing the connected badge feature that you just used. It is a requirement that any user of LinkedIn knows the distance from themselves to anyone else when they search. From there, you and your engineering team have a long list of approaches to consider.
您是否通过解决图的所有配对最短路径问题来预先计算已连接徽章的所有距离值?如果这样做,当 LinkedIn 网络中添加或删除新连接时会发生什么?
Do you precalculate all distance values for the connected badge by solving the all-pairs shortest path problem for your graph? If you do, what happens when new connections are added to or removed from LinkedIn’s network?
最终用户对于了解自己与他人的联系有何期望?为了优先考虑向最终用户传递信息的速度,您可以放宽哪些要求?
What are the end user’s expectations for knowing their connectedness to another person? What requirements can you relax in order to prioritize speed of delivering the information to the end user?
尽管这些问题是在 LinkedIn 的寻路背景下提出的,但对于任何想要在应用程序中使用路径距离的团队来说,所有这些问题都是常见的考虑因素。
Though presented in the context of pathfinding at LinkedIn, all of these questions are common considerations for any team wanting to use path distance in an application.
要回答有关应用程序设计的任何这些问题,您需要了解处理图结构数据的性能影响。在 LinkedIn 规模上,遍历图结构数据以解决问题的基本方法是基于 BFS 或 DFS 构建程序。
To answer any of those questions about your application’s design, you need to understand the performance implications of processing graph-structured data. And the fundamental approach to walking through graph-structured data to solve problems at LinkedIn scale builds procedures off of BFS or DFS.
在接下来的章节中,我们将使用这些基本技术来探索示例数据并在其中找到可信路径。为此,让我们介绍本章示例的数据,并将最短路径应用于我们的示例问题。
We will be using these fundamental techniques in the coming sections as we explore the example data and find paths of trust throughout it. To that end, let’s introduce the data for this chapter’s example and apply shortest paths to our sample problem.
Distance between concepts quantifies trust.
为了使这一公理栩栩如生,从现在到第 9 章结束,我们的运行示例深入探讨了比特币的世界。探索比特币交易者之间的信任网络会在图数据和信任之间的路径之间产生有趣的交集。具有讽刺意味的是,比特币的出现是以对中心化机构的不信任为中心。
To bring that axiom to life, our running example from now until the end of Chapter 9 dives into the world of Bitcoin. Exploring a network of trust between Bitcoin traders creates an interesting intersection between paths in graph data and trust. Ironically, the advent of Bitcoin centers on a distrust of centralized institutions.
在本节中,我们将介绍数据,简要了解比特币术语,并开发我们的数据模型。
In this section, we will introduce the data, walk through a brief primer on Bitcoin terminology, and develop our data model.
我们将探索在比特币场外交易 (OTC) 市场上交易比特币的人的网络。比特币场外交易市场允许其成员评估他们对其他成员的信任程度,这些评级构成了谁信任谁的网络,我们将在数据集中使用这些网络。这些评级的范围是 [-10, 10]。您将在后面的详细信息中看到评级,但我们直到第 9 章才会在查询中使用它们。数据来自 Srijan Kumar 等人的研究工作,可以在斯坦福网络分析平台上找到。2 3
We will be exploring a network of people who trade Bitcoin on the Bitcoin OTC (Over The Counter) Marketplace. The Bitcoin OTC Marketplace allows its members to rate how much they trust other members, and those ratings form who-trusts-whom networks, which we will be using in the dataset. These ratings are given on a scale of [–10, 10]. You will see the ratings in the details to come, but we won’t use them in our queries until Chapter 9. The data comes from the research work of Srijan Kumar et al. and can be found on the Stanford Network Analysis Platform.2 3
斯坦福网络分析平台(SNAP)是一个通用网络分析和图挖掘库。
Stanford Network Analysis Platform (SNAP) is a general-purpose network analysis and graph mining library.
Each line in the dataset has one rating, sorted by time, with the following format:
来源、目标、评级、时间
SOURCE, TARGET, RATING, TIME
每条数据的含义如下:
The meaning for each piece of the data is as follows:
源对目标的评分,范围从 -10 到 +10,步长为 1
The source’s rating for the target, ranging from –10 to +10 in steps of 1
我们看一下示例 8-1中的前五行数据。
Let’s look at the first five lines of the data in Example 8-1.
$ 头 -5 soc-sign-Bitcoinotc.csv
6,2,4,1289241911.72836
6,5,2,1289241941.53378
1,15,1,1289243140.39049
4,3,7,1289245277.36975
13,16,8,1289254254.44746$ head -5 soc-sign-Bitcoinotc.csv
6,2,4,1289241911.72836
6,5,2,1289241941.53378
1,15,1,1289243140.39049
4,3,7,1289245277.36975
13,16,8,1289254254.44746让我们来看看示例 8-1中的第一行数据:6,2,4,1289241911.72836。这意味着 ID 为 的人6对 ID 为 的人的信任度2总计为4。此评级是在纪元时间(即 2010 年 11 月 8 日星期一 13:45 GMT)捕获的1289241911.72836。
Let’s examine the first line of data from Example 8-1: 6,2,4,1289241911.72836. This means that the person with ID 6 trusts the person with ID 2 a total of 4. This rating was captured at 1289241911.72836 epoch time, or Monday, November 8, 2010, at 13:45 GMT.
原始源数据的时间以纪元为单位。本书随附的数据使用 ISO 8601 标准,因为我们在示例中转换了时间戳以便于理解。例如,1289241911.72836纪元时间转换为2010-11-08T13:45:11.728360ZISO 8601 标准中的时间。
The original source data has time in epoch. The data that accompanies this book uses the ISO 8601 standard because we converted the timestamps for ease of understanding in our examples. For example, 1289241911.72836 in epoch time converts to 2010-11-08T13:45:11.728360Z in the ISO 8601 standard.
在我们构建有趣的查询和数据模型之前,让我们先来了解一下比特币术语的世界。
Before we can build out interesting queries and a data model, let’s take a tour of the world of Bitcoin terminology.
比特币是一种加密货币,是一种去中心化的数字货币,这意味着没有中央银行或机构控制其价值。相反,比特币是在点对点网络上进行交换的。
Bitcoin is a cryptocurrency, which is used as a decentralized digital currency, meaning there is no central bank or institution that controls its value. Instead, Bitcoin is exchanged on a peer-to-peer network.
每个比特币基本上都是存储在智能手机或电脑上的数字钱包应用程序中的计算机文件。人们可以将整个比特币或其中的一部分发送到您的数字钱包,而您也可以将比特币发送给其他人。每笔交易都记录在称为区块链的公共列表中。
Each bitcoin is basically a computer file stored in a digital wallet application on a smartphone or computer. People can send whole bitcoins or fractions thereof to your digital wallet, and you can send bitcoins to other people. Every transaction is recorded in a public list called the blockchain.
地址是可以发送交易的比特币公钥。
An address is a Bitcoin public key to which transactions can be sent.
钱包是与地址对应的私钥的集合。
A wallet is a collection of private keys that correspond to addresses.
在我们的数据中,我们正在处理我们可以在区块链上观察到的内容。我们可以观察两个人之间的比特币交易。我们说你向一个地址发送比特币或从该地址接收比特币。你加密、导出、备份和导入你的钱包。一个钱包可以有多个与地址相对应的私钥。
In our data, we are working with what we can observe on the blockchain. We can observe the exchange of bitcoins between two people. We say you send bitcoins to or receive bitcoins from an address. You encrypt, export, back up, and import your wallet. A wallet can have multiple private keys that correspond to addresses.
从这里,我们可以为我们的示例定义一个在开发中使用的模式。
From here, we are able to define a schema to use in development for our example.
虽然样本数据显示的是整数,但真正的比特币地址实际上是最多 34 个字符的字母数字字符串。因此,我们将使用Text图模式中的数据类型来表示地址。
Though the sample data shows integers, real Bitcoin addresses actually are alphanumeric strings with up to 34 characters. Therefore, we will be using the Text data type in our graph schema for the addresses.
我们所需的数据模型非常简单。我们有一个对其他地址进行评级的地址列表。一个地址可以多次对另一个地址进行评级;我们希望通过其唯一的评级值来捕获每个评级。
The data model we will need is quite simple. We have a list of addresses that rated other addresses. An address can rate another address many times; we would like to capture each rating by its unique rating value.
我们用“这个地址给那个地址评分”这样的短语来讨论数据。应用我们的数据建模技巧,我们可以得到一个顶点标签Address和一个边标签。rated图8-6说明了我们示例的概念模型。
We talk about the data with the phrase “this address rated that address.” Applying our data modeling tips gives us one vertex label, Address, and one edge label, rated. Figure 8-6 illustrates the conceptual model for our example.
使用 GSL(第 2 章中的图模式语言),我们将图 8-6中的概念模型转换为示例 8-2中的模式代码。
Using the GSL (graph schema language from Chapter 2), we translate the conceptual model from Figure 8-6 into the schema code in Example 8-2.
schema.vertexLabel("Address").ifNotExists().partitionBy("public_key",Text).create();schema.edgeLabel("rated").ifNotExists().from("Address").to("Address").clusterBy("trust",Int,Desc).property("datetime",Text).create()
schema.vertexLabel("Address").ifNotExists().partitionBy("public_key",Text).create();schema.edgeLabel("rated").ifNotExists().from("Address").to("Address").clusterBy("trust",Int,Desc).property("datetime",Text).create()
按照第 4 章的设置,我们再次使用Text时间类型,以便在我们接下来的示例中更轻松地讲授概念。我们使用Text时间类型是因为我们将使用存储为文本的 ISO 8601 标准格式:YYYY-MM-DD’T’hh:mm:ss’Z’,其中2016-01-01T00:00:00.000000Z表示 2016 年 1 月初。
Following our setup from Chapter 4, we are again using Text as the type for time to make it easier to teach concepts in our upcoming examples. We are using Text for time because we will be using the ISO 8601 standard format stored as text: YYYY-MM-DD’T’hh:mm:ss’Z’, where 2016-01-01T00:00:00.000000Z represents the very beginning of January 2016.
一旦我们创建了图模式,我们就可以加载数据了。
Once we have created our graph schema, we are ready to load data.
我们做了一些基本的 ETL(提取-转换-加载)soc-sign-Bitcoinotc.csv来创建两个单独的文件:Address.csv和rated.csv。这项工作需要将datetime来自时代的数据转换为 ISO 8601 标准,以便数据可以加载到 DataStax Graph 中。
We did some basic ETL (extract-transform-load) on soc-sign-Bitcoinotc.csv to create two separate files: Address.csv and rated.csv. This work was required to translate the datetime data from epoch into ISO 8601 standard so that the data was ready to be loaded into DataStax Graph.
为了了解我们的数据,让我们看一下表 8-1rated.csv中的前五行。和以前一样,我们将文件设置为具有标题。标题行需要与示例 8-2中的 DataStax Graph 模式定义中的属性名称匹配。您还可以在使用加载工具时定义文件和数据库模式之间的映射。4csvcsv
To get an idea of our data, let’s take a look at the top five lines of rated.csv in Table 8-1. As before, we set up our csv file to have a header. The header line needs to match the names of the properties from your DataStax Graph schema definition in Example 8-2. You can also define a mapping between your csv file and database schema when using the loading tool.4
| 输出公钥 | 输入公钥 | 日期时间 | 相信 |
|---|---|---|---|
1128 1128 |
十三 13 |
2016-01-24T20:12:03.757280 2016-01-24T20:12:03.757280 |
2 2 |
十三 13 |
1128 1128 |
2016-01-24T18:53:52.985710 2016-01-24T18:53:52.985710 |
1 1 |
2731 2731 |
4897 4897 |
2016-01-24T18:50:34.034020 2016-01-24T18:50:34.034020 |
5 5 |
2731 2731 |
3901 3901 |
2016-01-24T18:50:28.049490 2016-01-24T18:50:28.049490 |
5 5 |
从表 8-1中,我们可以大致了解示例中的数据类型。两个公钥之间有边,这些边有两个属性:datetime和trust。边表示从一个密钥到另一个在特定时间创建并给予评级的密钥的信任评级。例如,让我们检查一行数据:
From Table 8-1, we can get an idea of the type of data in our example. We will have edges between two public keys, and those edges will have two properties: datetime and trust. The edge represents a trust rating from one key to another that was created at a certain time and given a rating. For example, let’s examine one line of data:
|1128|13|2016-01-24T20:12:03.757280|2
|1128|13|2016-01-24T20:12:03.757280|2
这一行的意思是持有密钥的钱包在2016年1月24日 20:12:04(四舍五入)1128给出了钱包的13信任评级。2
This line means that the wallet with the key 1128 gave wallet 13 a trust rating of 2 on January 24, 2016, at the time 20:12:04 (rounded).
随附脚本使用我们已经多次经历过的相同加载过程。如果您想查看代码,请前往本书 GitHub 存储库中的第 8 章数据目录,获取这些示例的数据和加载脚本。
The accompanying scripts use the same loading process that we have stepped through a few times now. If you would like to see the code, please head to the Chapter 8 data directory within the book’s GitHub repository for the data and loading scripts for these examples.
让我们做一些基本的探索性查询,以确保我们理解我们的数据并且它被正确加载。
Let’s do some basic exploratory queries to ensure that we understand our data and that it loaded correctly.
DataStax Studio 中的探索练习观察数据中的信任社区。
The exploration exercise in DataStax Studio observes communities of trust within the data.
我们首先确认已将正确数量的顶点和边加载到我们的图中。示例 8-3首先计算加载到 DataStax Graph 中的顶点总数,并将其与 SNAP 数据集进行比较。
We start by confirming that the correct number of vertices and edges have been loaded into our graph. Example 8-3 starts by counting the total number of vertices loaded into DataStax Graph to compare it to the SNAP dataset.
dev.V().hasLabel("Address").count()
dev.V().hasLabel("Address").count()
示例 8-3返回“5881”,这与从 SNAP 数据集加载的唯一公钥总数相匹配:5,881。接下来,示例 8-4计算加载到 DataStax Graph 中的边总数,以将其与 SNAP 数据集进行比较。
Example 8-3 returns “5881,” which matches the total number of unique public keys loaded from the SNAP dataset: 5,881. Next, Example 8-4 counts the total number of edges loaded into DataStax Graph to compare it to the SNAP dataset.
dev.E().hasLabel("rated").count()
dev.E().hasLabel("rated").count()
示例 8-4返回“35592”,确认来自 SNAP 数据集的唯一评级总数:35,592。
Example 8-4 returns “35592,” confirming the total number of unique ratings from the SNAP dataset: 35,592.
让我们看一下图 8-7中的信任社区子图,它显示了从起始地址开始的第二个社区。
Let’s look at a subgraph of trust communities in Figure 8-7, which shows the second neighborhood from a starting address.
DataStax Studio 通过 Louvain 社区检测算法使用模块化最大化为 Studio 客户端应用程序中的子图分配颜色。
DataStax Studio uses modularity maximization via the Louvain Community Detection Algorithm to assign colors to the subgraph within the Studio client application.
图 8-7显示了从单个起始顶点开始的第二个社区。我们打开 DataStax Studio 的图可视化选项以显示结果的图视图,并配置可视化以显示此子图内的社区检测。
Figure 8-7 displays the second neighborhood from a single starting vertex. We turned on DataStax Studio’s graph visualization option to display a graph view of the results and configured the visualization to show community detection within this subgraph.
如图8-7所示,探索图数据非常有趣。通过创建一个简单的架构并使用批量加载工具,我们希望您能够在几分钟内完成从架构创建到数据加载和图可视化的整个过程。
As we show in Figure 8-7, exploring graph data can be very fun. By creating a simple schema and using bulk loading tools, we hope you were able to follow along from schema creation to data loading and graph visualizations in a matter of minutes.
从这里开始,我们希望从数据探索转向定义查询。我们的目标是通过在此数据集中找到从一个公钥到另一个公钥的最短路径来量化两个钱包之间的信任。
From here, we want to move away from data exploration and into defining our queries. Our objective is to quantify trust between two wallets by finding the shortest path from one public key to another in this dataset.
我们的主要目标是找到一对好的地址,我们将在下一节的寻路示例中使用它们。对于我们配对中的第一个地址,我们稍微作弊了一点。我们只是随机选择了一个起始地址:public_key: 1094。本节中有趣的工作是查询周围的邻居1094以找到寻路查询的良好候选者。为了我们的目的,我们将寻找一个以前没有交易过1094但有许多共享连接的地址。
Our main objective is to find a good pair of addresses that we’ll use in our pathfinding examples in the next section. For the first address in our pair, we cheated a bit. We just randomly selected a starting address: public_key: 1094. The interesting work in this section queries the neighborhoods around 1094 to find a good candidate for pathfinding queries. For our purposes, we will be looking for an address that has not previously transacted with 1094 but has many shared connections.
我们正在构建一对顶点,以便稍后验证更长的查询。我们承认,这让我们的示例感觉像是虚构的,但我们正在融入测试驱动开发的实践,以说明如何测试新的 Gremlin 查询以获得有效和预期的结果。
We are constructing a pair of vertices so that we can validate our longer queries later. We admit that this makes our example feel concocted, but we are weaving in practices of test-driven development to illustrate how to test a new Gremlin query for valid and expected results.
让我们首先确定之前已评级的地址1094。
Let’s start by identifying the addresses that 1094 has previously rated.
第一个街区的地址与之前评级过的1094地址相同。示例 8-5回顾了如何在 Gremlin 中探索第一个街区:1094
The addresses in the first neighborhood of 1094 are the same as the addresses that 1094 has previously rated. Example 8-5 reviews how to explore the first neighborhood in Gremlin:
dev.V().has("Address","public_key","1094").out("rated").values("public_key")
dev.V().has("Address","public_key","1094").out("rated").values("public_key")
示例 8-5的结果中有 31 个唯一地址。其中前 5 个是:
There are 31 unique addresses in the results of Example 8-5. The first 5 of them are:
"1053","1173","1237","1242","1268",...
"1053","1173","1237","1242","1268",...
第一个街区内的 31 个地址不适合我们的例子,因为它们与 的距离为 1 ,就像您在图 8-81094中看到的那样。
The 31 addresses in the first neighborhood would not be good candidates for our example because they have a distance of 1 from 1094, like what you see in Figure 8-8.
我们搬进第二个街区吧。
Let’s move into the second neighborhood.
从第一个邻里,我们需要再走出一条边才能到达第二个邻里。示例 8-6展示了如何在 Gremlin 中走到第二个邻里。
From the first neighborhood, we need to walk out one more edge to reach the second neighborhood. Example 8-6 shows how to walk to the second neighborhood in Gremlin.
1094dev.V().has("Address","public_key","1094").out("rated").out("rated").dedup().// remove duplicates to get the list of unique neighborsvalues("public_key")
dev.V().has("Address","public_key","1094").out("rated").out("rated").dedup().// remove duplicates to get the list of unique neighborsvalues("public_key")
共有 613 个唯一地址。前 5 个是:
There are 613 unique addresses. The first 5 are:
"1053","1173","1162","1334","1241",...
"1053","1173","1162","1334","1241",...
你可能想知道为什么我们需要示例 8-6dedup()中的步骤。我们必须使用,因为我们想要第二个街区中唯一的地址集。如果没有,查询将返回 876 个结果;这额外的 263 个结果代表了从 走出两条边的多种方式。dedup()dedup()1094
You may be wondering why we needed the dedup() step in Example 8-6. We have to use dedup() because we want the unique set of addresses in the second neighborhood. Without dedup(), the query returns 876 results; those additional 263 results represent multiple ways to walk out two edges from 1094.
为了理解这一点,请考虑图 8-9。
To see this, consider Figure 8-9.
带有 的地址public_key 1334有两种不同的方式到达public_key 1094:通过1053或1173顶点。因此,public_key 1334将在 的第二个邻域中至少列出两次1094。使用dedup()会删除遍历流中的重复对象。对于示例 8-6,它获取 的所有观察值1334并将它们减少到结果集中的 1 个。
The address with public_key 1334 has two different ways to reach public_key 1094: via the 1053 or the 1173 vertex. Therefore, public_key 1334 will be listed at least twice in the second neighborhood of 1094. Using dedup() removes duplicate objects in the traversal stream. For Example 8-6, it takes all observations of 1334 and reduces them to just 1 in the result set.
使用dedup()显示我们如何得出结果集大小为613而不是876。
Using dedup() shows how we arrived at a result set size of 613 instead of 876.
然而,我们真正想要的是位于第二个街区但不位于第一个街区的地址。让我们看看如何在 Gremlin 中找到这组对象。
What we really want for our example, however, is an address that is in the second neighborhood but not the first. Let’s take a look at how we can find that set of objects in Gremlin.
在深入研究查询之前,让我们先考虑一下我们在这里要求什么。为了了解这一点,让我们回到我们的示例图数据,该数据显示了第二个邻域的一部分,如图8-101094所示。
Before we dive into the query, let’s think about what we are asking for here. To see that, let’s go back to our sample graph data that shows part of the second neighborhood of 1094, as seen in Figure 8-10.
1094来自 的第一个邻域的元素也可以属于 的第二个邻域的示例1094图 8-10显示了顶点1053和如何1173同时属于 的第一和第二邻域1094。我们最初的问题是寻找一个不直接与 相连的良好示例1094。我们需要从结果集中消除诸如1053和 之类的顶点。1173
Figure 8-10 shows how vertices 1053 and 1173 are members of both the first and the second neighborhood of 1094. Our original question is looking for a good example that is not directly connected to 1094. We need to eliminate vertices such as 1053 and 1173 from our result set.
在 Gremlin 中,我们可以使用aggregate("x")step 来填充一个对象,如x本例中命名的。然后我们可以使用该模式从结果集中消除不需要的顶点where(without("x"))。让我们在示例 8-7中看看它的实际作用。
In Gremlin, we can use aggregate("x") step to fill an object, named x in this example. Then we can eliminate the unwanted vertices from the result set with the where(without("x")) pattern. Let’s see this in action in Example 8-7.
1dev.V().has("Address","public_key","1094").aggregate("x").2out("rated").aggregate("x").3out("rated").4dedup().5where(without("x")).6values("public_key")
1dev.V().has("Address","public_key","1094").aggregate("x").2out("rated").aggregate("x").3out("rated").4dedup().5where(without("x")).6values("public_key")
示例 8-7的结果集有 590 个唯一元素。前 5 个是:
The result set for Example 8-7 has 590 unique elements. The first 5 are:
"628","1905","1013","1337","3062"...
"628","1905","1013","1337","3062"...
让我们看一下示例 8-7。在第 1 行,我们查询1094并初始化一个对象,x,并使用该顶点。在第 2 行,我们遍历第一个邻域并将所有这些顶点添加到 中x。然后我们在第 3 行走到第二个邻域。
Let’s walk through Example 8-7. On line 1, we query for 1094 and initialize an object, x, with that vertex. On line 2, we traverse to the first neighborhood and add all of those vertices into x. Then we walk out to the second neighborhood on line 3.
让我们详细讨论一下示例 8-7dedup()中第 4 行和第 5 行之间发生的事情。第 4 行的强制所有遍历器在移动到第 5 行之前完成工作。在本例中,我们等待所有遍历器到达第二个邻域,1094然后我们再继续。然后在第 5 行,我们应用带有模式的过滤器where(without("x"))。第 5 行本质上是在询问管道中的每个遍历器,“你在集合中吗x?”如果顶点在中x,则将其从管道中删除。如果不是,则允许遍历器继续。
Let’s talk in detail about what is happening between lines 4 and 5 in Example 8-7. The use of dedup() on line 4 forces all traversers to complete their work before moving to line 5. In this case, we are waiting for all traversers to reach the second neighborhood away from 1094 before we continue. Then on line 5, we apply a filter with the where(without("x")) pattern. Line 5 is essentially asking every traverser in the pipeline, “Are you in the set x?” If a vertex is in x, it is removed from the pipeline. If not, the traverser is allowed to continue.
如果您具有关系系统方面的丰富背景,则示例 8-7与在地址表上执行右外自连接非常相似。
If you have a strong background in relational systems, Example 8-7 is very similar to performing a right outer self join on the address table.
记住示例 8-7,让我们顺便了解一下 Gremlin 查询语言中惰性求值和急切求值之间的区别。我们需要深入研究 Gremlin 查询语言的求值策略,因为它们会改变遍历的行为,进而产生意想不到的查询结果。
With Example 8-7 in mind, let’s take a side tour into the differences between lazy and eager evaluation in the Gremlin query language. We need to dig into the evaluation strategies of the Gremlin query language because they change the behavior of your traversal, which in turn can produce unexpected query results.
我们即将深入探讨函数式编程。如果您不完全理解下一节,也没关系。您只需要了解要点:障碍步骤会影响 Gremlin 寻路中的 BFS 类和 DFS 类行为。
We are about to go really deep into functional programming. If you don’t fully understand the next section, it is OK. You just need to get the big-picture point: that barrier steps affect BFS-like and DFS-like behavior in pathfinding with Gremlin.
Gremlin 主要是一种惰性流处理语言。这意味着 Gremlin 会尝试处理遍历管道中的所有遍历器,然后再从遍历的开始处获取更多数据。这与急切求值策略不同,后者在进入下一步之前立即完成工作。
Gremlin is primarily a lazy stream-processing language. This means that Gremlin tries to process any traversers all the way through the traversal pipeline before getting more data from the start of the traversal. This is different from an eager evaluation strategy, which does the work right away before moving on to the next step.
Lazy evaluation delays the evaluation of an expression until its value is needed.
Eager evaluation evaluates an expression as soon as it is bound to a variable.
在许多情况下,Gremlin 语言无法使用惰性求值。
There are numerous situations in which the Gremlin language cannot use lazy evaluation.
我们现在讨论这个概念是因为我们在最近的遍历dedup()和aggregate()步骤中一直在使用急切评估。
We are talking about this concept now because we have been using eager evaluation in our recent traversals with the dedup() and aggregate() steps.
在日常生活中,您可以选择使用任一评估策略执行任务。您在烹饪时可能会使用急切评估,因为您会准备每道菜的配料,然后将其组装成单独的盘子。这与从头到尾单独烹饪盘子大小的份量形成对比。
In your daily life, you may choose to perform a task with either evaluation strategy. You probably use eager evaluation when you are cooking because you prepare each ingredient of your meal and then assemble individual plates. Contrast this with creating plate-sized portions that you cook individually from beginning to end.
在 Gremlin 中,了解遍历何时在惰性求值和急切求值之间切换的关键是识别屏障步骤。当存在屏障步骤时,Gremlin 遍历将从惰性求值变为急切求值。
In Gremlin, the key to knowing when a traversal changes between lazy and eager evaluation is to recognize the barrier steps. When a barrier step exists, a Gremlin traversal changes from lazy to eager evaluation.
The definition of a barrier step in the Gremlin query language is:
屏障步骤是将惰性遍历管道转变为批量同步管道的功能。
A barrier step is a function that turns the lazy traversal pipeline into a bulk-synchronous pipeline.
我们希望对障碍步骤做出这种区分,因为 Gremlin 查询语言在存在障碍步骤时会混合使用惰性求值策略和急切求值策略。
We want to make this distinction about barrier steps because the Gremlin query language mixes the use of lazy evaluation strategies and eager evaluation strategies when there are barrier steps.
障碍步骤改变查询的行为,使其像广度优先搜索或深度优先搜索一样运行。
Barrier steps change the behavior of a query to operate like breadth-first search or depth-first search.
本书中使用的屏障步骤示例有dedup、、、、、、、、和。aggregatecountordergroupgroupCountcapiteratefold
Examples of barrier steps used in this book are dedup, aggregate, count, order, group, groupCount, cap, iterate, and fold.
将这些概念放在一起思考的一种方法是,屏障步骤迫使管道像广度优先搜索一样执行。也就是说,屏障步骤迫使每个遍历器等待,直到管道中的所有其他遍历器都完成同一组工作。所有遍历器完成屏障步骤之前的工作后,它们才能继续。
One way to think about these concepts together is that barrier steps force a pipeline to execute like breadth-first search. That is, barrier steps force every traverser to wait until all other traversers in the pipeline have completed the same set of work. After all traversers complete the work up to a barrier step, they can continue.
我们在本书中演示的查询旨在教授实时应用中的常见模式。因此,我们在编写查询时混合了 BFS 和 DFS 行为。
The queries we demonstrate in this book aim to teach the common patterns found in real-time applications. As such, we are mixing BFS and DFS behavior as we write our queries.
我们将在后面的例子中应用障碍步骤和 BFS 之间的联系来保证查询中的最短路径。
We will apply the connection between barrier steps and BFS in a later example to guarantee shortest paths in our queries.
之前,我们成功找到了位于第二个邻域的顶点,但没有找到位于第一个邻域的顶点。现在让我们使用示例 8-8sample()中演示的步骤,随机选择其中一个用于其余查询。
Previously, we successfully found the vertices that are in the second neighborhood, but not the first. Now let’s use the sample() step, as demonstrated in Example 8-8, to randomly select one of them to use for the rest of our queries.
dev.V().has("Address","public_key","1094").aggregate("first_neighborhood").out("rated").aggregate("first_neighborhood").out("rated").dedup().where(neq("first_neighborhood")).values("public_key").sample(1)
dev.V().has("Address","public_key","1094").aggregate("first_neighborhood").out("rated").aggregate("first_neighborhood").out("rated").dedup().where(neq("first_neighborhood")).values("public_key").sample(1)
结果是:
The result is:
"1337""1337"
不管你信不信,1337这是我们随机抽取的第一个public_key样本,用于其余的寻路示例。我们将把它视为一个好兆头并坚持下去。
Believe it or not, 1337 was the first public_key we randomly sampled for the rest of our pathfinding examples. We are going to take that as a good omen and go with it.
现在我们有两个地址:1094和1337。让我们使用它们来展示如何使用 Gremlin 查找它们之间的路径。
Now we have two addresses: 1094 and 1337. Let’s use them to show how to find paths between them with Gremlin.
正如我们前面提到的,我们选择这个数据集和示例来说明使用路径解决复杂问题的强大功能。概念或人之间的距离为评估他们之间的关联程度以及是否可以信任他们提供了背景和意义。
As we mentioned earlier, we selected this dataset and example to illustrate the power of using paths to solve complex problems. Distance between concepts or people provides context and meaning for assessing how related they are and whether you can trust them.
对于其余的练习,我们希望你想象你已经加入了一个比特币市场,特别是比特币场外交易市场。 加入后,您会收到公钥1094。想一想您在该市场上与拥有公钥的成员进行的第一次交易1337。
For the rest of these exercises, we want you to imagine you have joined a Bitcoin marketplace, specifically the Bitcoin OTC. When you joined, you received the public key 1094. Think about your first transaction on that marketplace with a member who has the public key of 1337.
您对另一个地址有多大信任?
How much trust do you have in the other address?
通过路径分析量化您对另一个实体的信任是我们将在下一系列练习中确定的一个复杂问题。
Quantifying your trust in another entity with path analysis is the complex question we are going to determine in the next series of exercises.
接下来的例子有五个主要部分。
The upcoming example has five main sections.
第一部分从查找示例地址之间的固定长度路径开始。我们将在第一部分的查询的基础上,在第二部分中查找任意长度的路径。第二部分通过应用寻路技术演示了一种常见的进展,但它故意导致错误。
The first section begins with finding paths of fixed lengths between our example addresses. We will build upon the queries from the first section to find paths of any length in the second section. The second section illustrates a common progression through applying pathfinding techniques, but it purposefully leads to an error.
第三部分解释了如何通过使用 Gremlin 重新审视惰性求值和急切求值来解决我们的错误。第四部分探讨了如何理解示例数据中最短路径的路径权重。
The third section explains how we can resolve our error by revisiting lazy and eager evaluation with Gremlin. The fourth section explores understanding path weight for the shortest paths in our example data.
最后,我们将讨论如何解释路径长度和上下文以量化对我们问题的信任度。这将为我们在第 9 章中如何转换此数据集以查找加权最短路径奠定基础。
We will conclude with a discussion of how to interpret path length and context for quantifying trust to our question. This sets up how we will transform this dataset to find weighted shortest paths in Chapter 9.
我们首先通过探索邻域来寻找固定长度的路径,以便我们可以验证本节末尾的最短路径查询中显示的结果。
We are starting with finding paths of a fixed length by exploring neighborhoods so that we can validate the results that show up in our shortest path queries at the end of this section.
为了了解这些散步的心态,请考虑一下在接受某人的比特币交换邀请之前您想要了解什么。
To get into the mindset of these walks, consider what you would want to know before you accepted someone’s invitation to exchange bitcoins.
如果您即将在比特币场外交易市场上使用新地址进行交易,您可能想知道是否可以信任对方。量化您的信任的起点1337是找出您是否有共同的联系。在此数据集中查找共享连接与查找您评级的地址也评级的地址相同1337,反之亦然。这种类型的共享连接不关心评级的方向;我们只是想根据谁给谁评级来查看我们有哪些共享地址。
If you were about to transact with a new address on a Bitcoin OTC marketplace, you would likely want to know whether you can trust the other person. The place to start in quantifying your trust in 1337 is to find out whether you have shared connections. Finding shared connections in this dataset is the same as looking for addresses you rated that also rated 1337, or vice versa. This type of shared connection doesn’t care about the direction of the rating; we just want to see what shared addresses we have according to who rated whom.
一种方法是计算可以到达1337第二个邻居的路线数。让我们在示例 8-9中执行此查询。
One way to do this is to count the number of ways you can reach 1337 in your second neighborhood. Let’s do this query in Example 8-9.
dev.V().has("Address","public_key","1094").as("start").both("rated").both("rated").has("Address","public_key","1337").count()
dev.V().has("Address","public_key","1094").as("start").both("rated").both("rated").has("Address","public_key","1337").count()
示例 8-9的结果为 4。这意味着在您的第二个街区内,有四种方式可以从您的地址 步行1094到1337。让我们查看路径信息以了解这些步行方式。
The result of Example 8-9 is 4. This means that within your second neighborhood, there are four ways to walk from your address, 1094, to 1337. Let’s look at the path information to understand those walks.
回想我们第 6 章的讨论,该path()步骤将使您能够访问每个遍历器的完整历史记录。然后,您需要根据两个特征查看结果:(1) 沿途访问的顶点和 (2) 路径的长度。让我们在示例 8-10中执行此操作,然后介绍过程和结果。
Recalling our discussion from Chapter 6, the path() step will give you access to each traverser’s full history. Then you want to look at the results according to two features: (1) the vertices visited along the way and (2) the path’s length. Let’s do this in Example 8-10 and then walk through the process and results.
1dev.V().has("Address","public_key","1094").2both("rated").3both("rated").4has("Address","public_key","1337").5path().// traverser's full path history6by("public_key").as("traverser_path").// get each vertex's public key7count(local).as("total_vertices").// count the elements in the path8select("traverser_path","total_vertices")// select the path information
1dev.V().has("Address","public_key","1094").2both("rated").3both("rated").4has("Address","public_key","1337").5path().// traverser's full path history6by("public_key").as("traverser_path").// get each vertex's public key7count(local).as("total_vertices").// count the elements in the path8select("traverser_path","total_vertices")// select the path information
在查看示例 8-11中的结果之前,让我们先逐步完成示例 8-10中的查询。
Let’s step through the query in Example 8-10 before we look at the results in Example 8-11.
示例 8-10中的第 1 行到第 3行从 步行到第二个街区1094。第 4 行仅考虑以 结束的步行1337。然后,我们希望通过第 5 行的步骤从四个遍历器中获取路径信息。path()第 6 行将路径对象变异为仅显示它们public_key并存储对它的引用。然后在第 7 行,我们计算每条路径中对象的总数count(local)。 这里,局部作用域要求计算路径内对象的总数,而不是使用默认的全局作用域count(),后者将计算路径的总数。在第 8 行,我们选择每个路径对象以及每个路径内顶点的总数。
Lines 1 through 3 in Example 8-10 walk to the second neighborhood from 1094. Line 4 considers only those walks that ended at 1337. Then we want to get the path information from each of the four traversers via the path() step on line 5. Line 6 mutates the path objects to show only their public_key and stores a reference to it. Then on line 7, we count the total number of objects within each path with count(local). Here, the local scope asks to count the total number of objects within the path instead of using the default global scope of count(), which would count the total number of paths. On line 8, we select each path object alongside the total number of vertices within each path.
结果如实施例8-11所示。
The results are shown in Example 8-11.
{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},
{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},
路径对象包含非常有用的信息。我们发现我们实际上只共享地址1268。有四条路径,因为有四种不同的 how1094或1337rated组合1268。如果您愿意,您可以通过检查路径上的边来亲自确认这一点。但我们将继续进行下一个查询。
The path object has very useful information. We see that we only really share address 1268 in common. There are four paths because there were four different combinations of how 1094 or 1337 rated 1268. If you would like, you could confirm this for yourself by inspecting the edges along the paths. But we are going to move on to the next query.
找到我们第二个社区中的任何一条道路都是有帮助的1337。
It has been helpful to find any path in our second neighborhood to 1337.
但是,让我们开始更深入地研究数据并只考虑一个方向:out()。具体来说,我们想知道:1337在第三个邻域中我们可以找到多少条出站路径?此外,让我们通过使用模式来发现示例 8-12repeat().times(x)中的第三个邻域中的路径来简化此查询。
However, let’s start looking deeper into the data and consider only one direction: out(). Specifically, we want to know: how many outgoing paths to 1337 can we find in our third neighborhood? Further, let’s simplify this query by using the repeat().times(x) pattern to discover the paths in our third neighborhood in Example 8-12.
1dev.V().has("Address","public_key","1094").// start at 10942repeat(out("rated")).// walk out "rated" edges3times(3).// three times4has("Address","public_key","1337").// until you reach 13375path().// get the path of each traverser6by("public_key").as("traverser_path").// for each path, get vertex's key7count(local).as("total_vertices").// count the number of objects8select("traverser_path","total_vertices")// select the path, length
1dev.V().has("Address","public_key","1094").// start at 10942repeat(out("rated")).// walk out "rated" edges3times(3).// three times4has("Address","public_key","1337").// until you reach 13375path().// get the path of each traverser6by("public_key").as("traverser_path").// for each path, get vertex's key7count(local).as("total_vertices").// count the number of objects8select("traverser_path","total_vertices")// select the path, length
前三个结果如示例 8-13所示。
The first three results are shown in Example 8-13.
{"traverser_path":{"labels":[[],[],[],[]],"objects":["1094","1268","35","1337"]},"total_vertices":"4"},{"traverser_path":{"labels":[[],[],[],[]],"objects":["1094","280","35","1337"]},"total_vertices":"4"},{"traverser_path":{"labels":[[],[],[],[]],"objects":["1094","1053","1268","1337"]},"total_vertices":"4"},...
{"traverser_path":{"labels":[[],[],[],[]],"objects":["1094","1268","35","1337"]},"total_vertices":"4"},{"traverser_path":{"labels":[[],[],[],[]],"objects":["1094","280","35","1337"]},"total_vertices":"4"},{"traverser_path":{"labels":[[],[],[],[]],"objects":["1094","1053","1268","1337"]},"total_vertices":"4"},...
在示例 8-13中,我们可以看到从1094到 的一些有趣路径。第三个结果显示评级,谁评级,谁评级。在我们的第三个邻域中,从到共有 11 条传出路径。1337109410531268133710941337
We can see some interesting paths from 1094 to 1337 in Example 8-13. The third result shows that 1094 rated 1053, who rated 1268, who rated 1337. There are 11 total outgoing paths in our third neighborhood from 1094 to 1337.
我们可以推广该repeat().times(x)模式来查找已知长度的路径。但是,我们的总体目标是找到任意长度的路径,最终发现如何使用 Gremlin 来发现最短路径。
We can generalize the repeat().times(x) pattern to find paths of a known length. However, our overarching goal is to find paths of any length to eventually discover how to use Gremlin to discover shortest paths.
路径查找查询用于发现图中两件事物连接在一起的关系。我们希望发现数据中存在的关系的数量和深度。
Pathfinding queries are for discovering the relationships that connect two things in your graph together. We want to discover both the quantity and the depth of the relationships that exist in the data.
也就是说,我们不是查询已知长度的路径,而是想要找到任意长度的路径。
That is, instead of querying for paths of a known length, we want to find paths of any length.
我们经常看到工程师从定义长度的路径跳转到无限长度的路径,就像我们在示例 8-14中看到的那样。如表 8-2所示,这可能会导致执行错误。
More often than not, we see engineers make the leap from paths of defined length to paths of unbounded length with queries like what we have in Example 8-14. As you see in Table 8-2, this is likely going to lead to an execution error.
1dev.V().has("Address","public_key","1094").// start at 10942repeat(out("rated")).// walk out rated edges3until(has("Address","public_key","1337")).// WARNING: this is all-paths!4path().5by("public_key").as("traverser_path").6count(local).as("total_vertices").7select("traverser_path","total_vertices")
1dev.V().has("Address","public_key","1094").// start at 10942repeat(out("rated")).// walk out rated edges3until(has("Address","public_key","1337")).// WARNING: this is all-paths!4path().5by("public_key").as("traverser_path").6count(local).as("total_vertices").7select("traverser_path","total_vertices")
如果你在 DataStax Studio 中运行示例 8-14 ,你很可能会看到表 8-2所示的错误:
If you ran Example 8-14 in DataStax Studio, you most likely saw the error shown in Table 8-2:
| 系统错误 |
|---|
请求评估超出了配置的阈值 Request evaluation exceeded the configured threshold of |
请求的 realtime_evaluation_timeout 为 30000 毫秒 realtime_evaluation_timeout at 30000 ms for the request |
让我们看一下示例 8-14中发生的情况,以便我们能够理解表 8-2中的错误。示例 8-14中的第 1 行访问起始地址。1094第 2 行和第 3 行应用了该repeat().until()模式。repeat()步骤告诉遍历器在步骤中的中断条件之前应该做什么until()。我们刚刚要求遍历器继续搜索以 开头1094并以 结尾的任何路径1337。这将探索整个连通图以查找所有以 结尾的路径。这就是我们得到表 8-21337中的超时错误的原因。
Let’s walk through what is happening in Example 8-14 so that we can understand the error from Table 8-2. Line 1 in Example 8-14 accesses the starting address, 1094. Lines 2 and 3 apply the repeat().until() pattern. The repeat() step tells a traverser what it is supposed to do until the breaking condition from the until() step. We have just asked our traversers to keep searching for any path that starts at 1094 and ends at 1337. This is going to explore the entire connected graph for all paths that end at 1337. This is why we get the timeout error in Table 8-2.
对于我们的问题,我们不需要所有路径。我们想要找到最短路径。让我们将一些概念联系在一起,尝试一种不同的方法。
For our problem we do not want all paths. We want to find the shortest path. Let’s connect some concepts together to try a different approach.
回想一下我们在“使用 Gremlin 查询语言的评估策略”中的讨论。我们讨论了评估策略、障碍步骤,并思考了广度优先或深度优先搜索如何应用于遍历。
Recall our discussion from “Evaluation Strategies with the Gremlin Query Language”. We went through evaluation strategies, barrier steps, and thinking through how breadth-first or depth-first searching applies to traversals.
我们告诉过你我们将应用这些事实来寻找最短路径。现在就开始吧。
We told you we were going to apply those facts to find shortest paths. Let’s do that now.
我们需要搞清楚我们的寻路遍历是用BFS还是DFS,如果是BFS,那么我们可以保证第一个满足停止条件的遍历器就是最短路径。
We need to figure out whether our pathfinding traversal is using BFS or DFS. If it is using BFS, then we can guarantee that the first traverser that satisfies the stopping condition is the shortest path.
用 Gremlin 的思维来说,热切评估的遍历提供了我们保证 BFS 行为所需的行为。 确定遍历是否使用急切评估的关键是找出其执行策略是否使用了障碍步骤。
Thinking in Gremlin, traversals that are eagerly evaluated provide the behavior we need to guarantee BFS behavior. The key to figuring out if your traversal uses eager evaluation is to find out whether its execution strategy uses barrier steps.
查看示例 8-14中的遍历,查询未使用我们之前讨论过的任何屏障步骤。我们遗漏了什么?
Looking at our traversal from Example 8-14, the query did not use any of the barrier steps we talked about before. What are we missing?
回答这个问题的最佳方式是检查步骤以查看应用了哪些遍历策略,正如我们在示例 8-15explain()中的查询中所做的那样。
The definitive way to answer this for yourself is to inspect the explain() step to see what traversal strategies are applied, as we have done in the query in Example 8-15.
g.V().has("Address","public_key","1094").repeat(out("rated").until(has("Address","public_key","1337")).explain()==>TraversalExplanationOriginalTraversal[GraphStep(vertex,[]),RepeatStep([VertexStep(OUT,vertex),RepeatEndStep],until(),emit(false))]...FinalTraversal[TinkerGraphStep(vertex,[]),VertexStep(OUT,vertex),NoOpBarrierStep(),// Note: Barrier Execution StrategyVertexStep(OUT,vertex),NoOpBarrierStep(),// Note: Barrier Execution StrategyVertexStep(OUT,vertex),NoOpBarrierStep()]// Note: Barrier Execution Strategy
g.V().has("Address","public_key","1094").repeat(out("rated").until(has("Address","public_key","1337")).explain()==>TraversalExplanationOriginalTraversal[GraphStep(vertex,[]),RepeatStep([VertexStep(OUT,vertex),RepeatEndStep],until(),emit(false))]...FinalTraversal[TinkerGraphStep(vertex,[]),VertexStep(OUT,vertex),NoOpBarrierStep(),// Note: Barrier Execution StrategyVertexStep(OUT,vertex),NoOpBarrierStep(),// Note: Barrier Execution StrategyVertexStep(OUT,vertex),NoOpBarrierStep()]// Note: Barrier Execution Strategy
此explain()步骤将打印出您的遍历的遍历说明。遍历说明详细说明了如何explain()根据已注册的遍历策略编译遍历(之前)。
The explain() step prints out the traversal explanation for your traversal. A traversal explanation details how the traversal (prior to explain()) will be compiled given the registered traversal strategies.
查看示例 8-15,我们发现了一些非常有趣的东西NoOpBarrierStep:遍历解释中的存在NoOpBarrierStep告诉我们遍历引擎通过该repeat()步骤注入了障碍步骤。
Looking at Example 8-15, we see something very interesting: NoOpBarrierStep. The presence of NoOpBarrierStep in the traversal explanation informs us that the traversal engine injects barrier steps with the repeat() step.
我们使用示例 8-15中的信息知道该repeat().until()模式使用了屏障。这意味着它使用广度优先搜索来积极执行。
We use the information from Example 8-15 to know that the repeat().until() pattern uses barriers. This means it executes eagerly using breadth-first search.
对示例 8-14进行一处小改动,我们可以在示例 8-16中应用这些知识,找到从1094到 的单一最短路径1337。
With one small change to Example 8-14, we can apply this knowledge in Example 8-16, which finds the single shortest path from 1094 to 1337.
1dev.V().has("Address","public_key","1094").// start at 10942repeat(out("rated")).// walk out rated edges3until(has("Address","public_key","1337")).// until 13374limit(1).// BFS: the first traverser the shortest path5path().// get the traverser's path information6by("public_key").as("traverser_path").// get each vertex's public_key7count(local).as("total_vertices").// count each path's length8select("traverser_path","total_vertices")// select the path information
1dev.V().has("Address","public_key","1094").// start at 10942repeat(out("rated")).// walk out rated edges3until(has("Address","public_key","1337")).// until 13374limit(1).// BFS: the first traverser the shortest path5path().// get the traverser's path information6by("public_key").as("traverser_path").// get each vertex's public_key7count(local).as("total_vertices").// count each path's length8select("traverser_path","total_vertices")// select the path information
示例 8-16中需要理解的重要一行是第 4 行。该limit(1)步骤仅将一个遍历器传递到剩余的管道中。因为这些repeat().until()步骤都是经过急切评估的,所以我们可以保证第一个满足停止条件的遍历者也是最短路径!
The important line to understand in Example 8-16 is line 4. The limit(1) step passes only one traverser into the remaining pipeline. Because the repeat().until() steps are eagerly evaluated, we can guarantee that the first traverser to satisfy the stopping condition is also the shortest path!
该遍历器的路径对象是:
The path object for this traverser is:
{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"}
{"traverser_path":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"}
这证实了我们之前做的例子中已经知道的:从1094到 的最短路径1337是通过1268。我们花了很多时间来设置这个例子并遍历固定长度的路径,以便当我们到达这里时,我们可以确认我们找到的路径确实是最短的。
This confirms what we already knew from the examples we did a while back: the shortest path from 1094 to 1337 is through 1268. We spent so much time setting up this example and walking through paths of fixed length so that when we got here, we could confirm that the path we found was indeed the shortest.
稍微缩小一点,让我们思考一下如何应用这些信息来回答本节的主要问题。我们发现你们有一个共同的地址:1268。我们还知道有 11 种方法可以找到你和 有共同点的朋友的朋友1337,这相当于说你和 之间有 11 条长度为 3 的路径1337。
Zooming back out a bit, let’s think about how we would want to apply this information to answer this section’s main question. We have discovered you have one address in common: 1268. We also know that there are 11 ways we can find friends of friends that you have in common with 1337, which is the same as saying there are 11 paths of length 3 between you and 1337.
如果您真的想决定是否与 进行交易1337,您是否有足够的信息?您会信任这个地址吗?
If you were really trying to make a decision about transacting with 1337, would you have enough information? Would you trust this address?
也许您想了解这些路径上给出的评级类型。让我们看看最后三个查询,开始量化从我们的边缘到我们的路径的信任。
Maybe you want to understand the types of ratings that were given on these paths. Let’s look at three final queries to start to quantify trust from our edges onto our paths.
您可能想要考虑的下一条信息是所有这些路径上数据的信任评级。要查看这些内容,我们需要重新格式化查询的数据结构。示例 8-17以两种方式扩展了示例 8-161094中的最短路径查询。首先,它应用我们对 BFS 和 Gremlin 查询处理的知识来查找从到 的前 15 条最短路径1337。然后,它使用步骤重新格式化结果project()。让我们看一下查询及其结果。
The next piece of information you likely want to consider is the trust ratings in the data along all of these paths. To look at those, we will want to reformat the data structure for our queries. Example 8-17 expands our shortest path query from Example 8-16 in two ways. First, it applies our knowledge of BFS and Gremlin query processing to find the top 15 shortest paths from 1094 to 1337. Then, it reformats the results using the project() step. Let’s take a look at the query and its results.
1dev.V().has("Address","public_key","1094").2repeat(out("rated")).3until(has("Address","public_key","1337")).4limit(15).// BFS: return the first 15 shortest paths by length5project("path_information","total_vertices").6by(path().by("public_key")).7by(path().count(local))
1dev.V().has("Address","public_key","1094").2repeat(out("rated")).3until(has("Address","public_key","1337")).4limit(15).// BFS: return the first 15 shortest paths by length5project("path_information","total_vertices").6by(path().by("public_key")).7by(path().count(local))
{"path_information":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},{"path_information":{"labels":[[],[],[],[]],"objects":["1094","280","35","1337"]},"total_vertices":"4"},{"path_information":{"labels":[[],[],[],[]],"objects":["1094","1268","35","1337"]},"total_vertices":"4"},...
{"path_information":{"labels":[[],[],[]],"objects":["1094","1268","1337"]},"total_vertices":"3"},{"path_information":{"labels":[[],[],[],[]],"objects":["1094","280","35","1337"]},"total_vertices":"4"},{"path_information":{"labels":[[],[],[],[]],"objects":["1094","1268","35","1337"]},"total_vertices":"4"},...
示例 8-17的主要工作在第 4 行到第 7 行。在第 4 行,我们仅采用从第 3 行达到停止条件的前 15 个遍历器。由于 Gremlin 使用障碍步骤来处理图数据(如广度优先搜索),因此保证这前 15 个遍历器是 15 条最短路径。
The main work of Example 8-17 is on lines 4 through 7. On line 4, we are taking only the first 15 traversers that reach the stopping condition from line 3. These first 15 traversers are guaranteed to be the 15 shortest paths because of how Gremlin uses barrier steps to process graph data like breadth-first search.
然后,从示例 8-17的第 5 行开始,您将看到我们将如何格式化本章中剩余查询的结果。我们要创建一个包含键和值的映射。映射中的键将是path_information和total_vertices。第 6 行的调制器使用从到的对象的格式化版本by()填充键。第 7 行的调制器使用从到的路径上每个访问过的顶点的填充键。path_informationpath()10941337by()total_verticespublic_key10941337
Then, starting on line 5 of Example 8-17, you see how we are going to format our results for the remaining queries in this chapter. We want to create a map with keys and values. The keys in the map will be path_information and total_vertices. The by() modulator on line 6 fills in the path_information key with a formatted version of the path() object from 1094 to 1337. The by() modulator on line 7 fills in the total_vertices key with the public_key of each visited vertex on the path from 1094 to 1337.
让我们在示例 8-17中的映射中添加一个键。trust将沿途各边的值相加rated,并将此键/值对添加到我们的结果集中。将每条边的值相加将表示从到 的trust路径的总信任度。10941337
Let’s add one more key to the map from Example 8-17. Let’s add up the trust values on the rated edges along the way and add this key/value pair to our results set. Adding up the trust values for each edge will represent the total trust of the path from 1094 to 1337.
在浏览图数据时,我们需要一种方法来汇总我们在此过程中处理的信息。Gremlinsack()中的步骤为我们提供了这种能力。
As you walk through graph data, we will need a way to aggregate information that we process along the way. The sack() step in Gremlin gives us this ability.
您可以将此步骤视为sack()在 Gremlin 遍历数据的旅程开始时为其提供一个背包。在此过程中,您会告诉遍历器要从背包(麻袋)中添加或移除什么。这对于沿途收集顶点或边上的值并使用它们进行决策或收集指标非常有用。
You can think of the sack() step as giving your Gremlin traverser a backpack at the start of its journey through the data. Along the way, you tell your traverser what to add or remove from the backpack (sack). This is very useful for collecting values on vertices or edges along the way and using them to make decisions or collect metrics.
对于我们的路径,我们希望将边缘的信任评级相加。我们将为遍历器提供一个空袋子,然后在它走过边缘时用信任评级扩充其内容。图 8-11从概念上展示了它的工作原理。
For our paths, we want to add up the trust ratings from the edges. We will be giving our traverser an empty sack to start and then augmenting its contents with the trust ratings as it walks over edges. Figure 8-11 shows how this works conceptually.
1094沿着路径移动到终点 ,1337并沿途将边上的信任评级存储在其袋子中在图 8-111094中,遍历者从到走最短路径1337:一条经过顶点的长度为 2 的路径1268。我们展示了如何使用sack()对象在遍历过程中收集和汇总来自边的信任评级。sack()这条路径的最终值为 10。
In Figure 8-11, the traverser walks the shortest path from 1094 to 1337: a path of length 2 via the 1268 vertex. We show how we can use the sack() object to collect and aggregate the trust ratings from the edges during the traversal. The ending sack() value for this path is 10.
图 8-11中还有第二条路径可供遍历探索。这条较长的路径穿过顶点,如图 8-121053所示。
There is a second path a traversal can explore in Figure 8-11. This longer path that traverses through the 1053 vertex is shown in Figure 8-12.
Gremlin 中 sack 结构的定义如下:
The definitions for sack constructs in Gremlin are:
遍历器可以包含一个称为 的本地数据结构sack。
A traverser can contain a local data structure called a sack.
该sack()步骤用于读取和写入袋子。
The sack() step is used to read and write sacks.
每个遍历器的每个袋子在使用时都会被初始化withSack()。
Each sack of each traverser is initialized when using withSack().
1094通过不同的路径移动到终点 ,1337并沿途将边上的信任评级存储在其袋子中重新审视我们的查询,我们可以计算出 15 条最短路径的总信任度。我们现在知道我们将使用步骤sack()来加总每条路径的信任评级。我们还想将此数据作为新密钥添加到结果负载中。密钥total_trust将成为步骤中的新密钥project()。的值total_trust将是使用步骤的路径上边权重的总和sack()。
Revisiting our query, we can calculate the total trust for our 15 shortest paths. We now know that we will use the sack() step to add up the trust ratings for each path. We also want to add this data as a new key in our result payload. The key total_trust will be our new key in the project() step. The value for total_trust will be the sum of the edge weights along the path using the sack() step.
让我们在示例 8-19中看看如何在 Gremlin 中做到这一点。
Let’s see how we do this in Gremlin in Example 8-19.
1dev.withSack(0.0).// initialize each traverser with a value of 0.02V().3has("Address","public_key","1094").as("start").4repeat(outE("rated").// walk out and stop on the "rated" edge5sack(sum).// add to the traverser's sack6by("trust").// the value from the property "trust"7inV()).// leave the edge and walk to the incoming vertex8until(has("Address","public_key","1337")).// repeat until 13379limit(15).// limit to the 15 shortest paths by length10project("path_information","vertices_plus_edges","total_trust").// a map11by(path().by("public_key").by("trust")).// first value: path information12by(path().count(local)).// second value: path's length13by(sack())// third value: path's trust score
1dev.withSack(0.0).// initialize each traverser with a value of 0.02V().3has("Address","public_key","1094").as("start").4repeat(outE("rated").// walk out and stop on the "rated" edge5sack(sum).// add to the traverser's sack6by("trust").// the value from the property "trust"7inV()).// leave the edge and walk to the incoming vertex8until(has("Address","public_key","1337")).// repeat until 13379limit(15).// limit to the 15 shortest paths by length10project("path_information","vertices_plus_edges","total_trust").// a map11by(path().by("public_key").by("trust")).// first value: path information12by(path().count(local)).// second value: path's length13by(sack())// third value: path's trust score
让我们逐步完成示例 8-19。第 1 行显示如何初始化遍历以对查询中的每个遍历器使用本地数据结构:withSack(0.0)。下一部分要真正深入研究的是第 4 行到第 8 行。在第 4 行和第 8 行中,我们看到了repeat()/until()通过广度优先搜索遍历图数据进行寻路的预期模式。但请注意,第 4 行使用了步骤outE()。使用outE()确保每个遍历器停在两个顶点之间的边上。有必要在边上停下来,这样我们才能收集信任评级。然后在第 5 行,我们告诉遍历器通过向其袋子中添加一些内容sack(sum)。使用by()调制器来告诉袋子要添加什么。您可以by()在第 8 行找到调制器:by("trust")。的模式sack(sum).by("trust")告诉遍历器从其当前对象(即边)收集trust属性,并将其添加到其袋子中。
Let’s step through Example 8-19. Line 1 shows how to initialize your traversal to use a local data structure for each traverser in your query: withSack(0.0). The next section to really dig into is lines 4 through 8. On line 4 and line 8, we see the expected repeat()/until() pattern for walking through our graph data for pathfinding with breadth-first search. Notice, however, that line 4 uses the outE() step. Using outE() ensures that each traverser stops on the edge between two vertices. It is necessary to stop on edges so we can collect the trust rating. Then on line 5, we tell the traverser to add something into its sack via sack(sum). You use by() modulators to tell the sack what you are adding into it. You find the by() modulator on line 8: by("trust"). The pattern of sack(sum).by("trust") tells the traverser to collect the trust property from its current object, which is an edge, and to add it to the value currently in its sack.
inV()然后我们在第 7 行告诉遍历器移动到具有的传入顶点。第 8 行的停止条件要求遍历器重复此行为直到到达1337。满足此条件的前 15 个遍历器将继续执行project()第 10 行的步骤。第 10 行将结果格式化为哈希图。哈希图中的第一个键和值对path将对象格式化为分别从顶点的公钥和边的trust值交替出现。哈希图中的第二个键和值对计算路径对象中的对象总数。因为我们在第 4 行沿途访问了边,所以我们的路径对象中将同时拥有顶点和边。因此,我们预计第 12 行计算的总数是沿途顶点和边的总和。
Then we tell the traverser to move to the incoming vertex with inV() on line 7. The stopping condition on line 8 asks a traverser to repeat this behavior until it reaches 1337. The first 15 traversers that meet this condition continue down into the project() step on line 10. Line 10 formats our results into a hashmap. The first key and value pair in the hashmap formats the path object to alternate from the vertex’s public key and the edge’s trust value, respectively. The second key and value pair in the hashmap counts the total number of objects in the path object. Because we visited edges along the way at line 4, we will have both vertices and edges in our path object. Therefore, we expect the total calculated on line 12 to be the sum of vertices and edges along the way.
最后,示例 8-19中的第 13 行告诉每个遍历器在我们用 读取其内容时报告其袋子的值。示例 8-19sack()的完整结果表显示在示例 8-20中。
Last, line 13 in Example 8-19 tells each traverser to report its sack’s value as we read its contents with sack(). The full table of results from Example 8-19 is shown in Example 8-20.
{"path_information":{"labels":[[],[],[],[],[]],"objects":["1094","9","1268","1","1337"]},"vertices_plus_edges":"5","total_trust":"10.0"},{"path_information":{"labels":[[],[],[],[],[],[],[]],"objects":["1094","4","1053","1","1268","1","1337"]},"vertices_plus_edges":"7","total_trust":"6.0"},{"path_information":{"labels":[[],[],[],[],[],[],[]],"objects":["1094","9","1268","1","35","9","1337"]},"vertices_plus_edges":"7","total_trust":"19.0"},...
{"path_information":{"labels":[[],[],[],[],[]],"objects":["1094","9","1268","1","1337"]},"vertices_plus_edges":"5","total_trust":"10.0"},{"path_information":{"labels":[[],[],[],[],[],[],[]],"objects":["1094","4","1053","1","1268","1","1337"]},"vertices_plus_edges":"7","total_trust":"6.0"},{"path_information":{"labels":[[],[],[],[],[],[],[]],"objects":["1094","9","1268","1","35","9","1337"]},"vertices_plus_edges":"7","total_trust":"19.0"},...
查看最受信任的路径将是最有趣的。让我们对示例 8-19的结果添加一些排序,以显示我们的 15 条最短路径,并按其总信任度排序。在获得 15 条最短路径之后,在对其进行格式化之前,我们只需应用排序逻辑。您可以在示例 8-21中的第 10 行和第 11 行看到这一点。
It would be most interesting to look at the paths with the most trust. Let’s add some sorting to the results from Example 8-19 to display our 15 shortest paths, sorted by their total trust. After we have our 15 shortest paths and before we format them, we just need to apply the sorting logic. You see this on lines 10 and 11 in Example 8-21.
1dev.withSack(0.0).2V().3has("Address","public_key","1094").4repeat(outE("rated").5sack(sum).6by("trust").7inV()).8until(has("Address","public_key","1337")).9limit(15).10order().// order all 15 paths11by(sack(),decr).// according to each traverser's sack value, decreasing12project("path_information","vertices_plus_edges","total_trust").13by(path().by("public_key").by("trust")).14by(path().count(local)).15by(sack())
1dev.withSack(0.0).2V().3has("Address","public_key","1094").4repeat(outE("rated").5sack(sum).6by("trust").7inV()).8until(has("Address","public_key","1337")).9limit(15).10order().// order all 15 paths11by(sack(),decr).// according to each traverser's sack value, decreasing12project("path_information","vertices_plus_edges","total_trust").13by(path().by("public_key").by("trust")).14by(path().count(local)).15by(sack())
示例 8-21中第 10 行和第 11 行的排序逻辑根据每个遍历器袋子中的值,对管道中的 15 个遍历器进行全局降序排列。第一个结果如示例 8-22所示。
The sorting logic on lines 10 and 11 in Example 8-21 globally arranges, in decreasing order, the 15 traversers in the pipeline according to the value within each traverser’s sack. The first result is shown in Example 8-22.
{"path_information":{"labels":[[],[],[],[],[],[],[],[],[]],"objects":["1094","9","1268","10","1094","9","1268","1","1337"]},"vertices_plus_edges":"9","total_trust":"29.0"},...
{"path_information":{"labels":[[],[],[],[],[],[],[],[],[]],"objects":["1094","9","1268","10","1094","9","1268","1","1337"]},"vertices_plus_edges":"9","total_trust":"29.0"},...
您是否注意到示例 8-22中的意外情况?最高权重路径在1094和之间有两个循环1268。这种类型的路径在我们的应用中没有意义,因为我们不止一次考虑两个键之间的评级。
Did you notice something unexpected in Example 8-22? The highest weighted path has two cycles between 1094 and 1268. This type of path wouldn’t make sense in our application because we are considering the ratings between two keys more than once.
simplePath()我们在第 6 章中介绍并使用了消除循环;让我们在这里添加该步骤。添加simplePath()会调整我们的最终查询,以找到 15 条无循环的最短路径,然后按总信任分数对这 15 条路径进行降序排序。示例 8-23显示了我们的最终查询,示例 8-24显示了结果。
We introduced and used simplePath() in Chapter 6 to remove cycles; let’s add that step here. Adding simplePath() adjusts our final query to find the 15 shortest paths without cycles and then sorts the 15 paths by their aggregated trust score, in descending order. Example 8-23 shows our final query, and Example 8-24 displays the results.
1dev.withSack(0.0).2V().3has("Address","public_key","1094").4repeat(outE("rated").5sack(sum).6by("trust").7inV()8simplePath()).// remove a traverser if there is a cycle in its path9until(has("Address","public_key","1337")).10limit(15).11order().12by(sack(),decr).13project("path_information","vertices_plus_edges","total_trust").14by(path().by("public_key").by("trust")).15by(path().count(local)).16by(sack())
1dev.withSack(0.0).2V().3has("Address","public_key","1094").4repeat(outE("rated").5sack(sum).6by("trust").7inV()8simplePath()).// remove a traverser if there is a cycle in its path9until(has("Address","public_key","1337")).10limit(15).11order().12by(sack(),decr).13project("path_information","vertices_plus_edges","total_trust").14by(path().by("public_key").by("trust")).15by(path().count(local)).16by(sack())
例 8-24显示了前 15 条无循环的最短路径,按其总体信任分数排序。
Example 8-24 shows the top 15 shortest paths without cycles, sorted by their aggregated trust score.
{"path_information":{"labels":[[],[],[],[],[],[],[],[],[]],"objects":["1094","10","64","10","104","3","35","9","1337"]},"vertices_plus_edges":"9","total_trust":"32.0"},{"path_information":{"labels":[[],[],[],[],[],[],[],[],[]],"objects":["1094","9","1268","2","1201","5","35","9","1337"]},"vertices_plus_edges":"9","total_trust":"25.0"},{"path_information":{"labels":[[],[],[],[],[],[],[],[],[]],"objects":["1094","3","280","8","35","9","1337"]},"vertices_plus_edges":"7","total_trust":"20.0"}...
{"path_information":{"labels":[[],[],[],[],[],[],[],[],[]],"objects":["1094","10","64","10","104","3","35","9","1337"]},"vertices_plus_edges":"9","total_trust":"32.0"},{"path_information":{"labels":[[],[],[],[],[],[],[],[],[]],"objects":["1094","9","1268","2","1201","5","35","9","1337"]},"vertices_plus_edges":"9","total_trust":"25.0"},{"path_information":{"labels":[[],[],[],[],[],[],[],[],[]],"objects":["1094","3","280","8","35","9","1337"]},"vertices_plus_edges":"7","total_trust":"20.0"}...
示例 8-23中的查询汇集了开发中要探索的所有路径概念。我们利用 Gremlin 中广度优先搜索的知识来查找 15 条最短路径,并使用每条路径的权重来扩充我们的结果。示例 8-24显示了权重最高的前三条路径。我们的结果表明,路径越长,其权重越高。
The query in Example 8-23 brings together all of the path concepts to explore in development. We use our knowledge of breadth-first search in Gremlin to find the 15 shortest paths and augment our results with each path’s weight. The top three highest weighted paths are shown in Example 8-24. Our results show that the longer the path, the higher its weight.
示例 8-24中的结果是否符合您要确定是否信任地址的要求1337?
Are the results in Example 8-24 what you would want to use for determining whether you trust address 1337?
您可能会大喊“不!”示例 8-24中的结果表明,信任值最高的路径也是最长的路径。在我们的数据中,更长的路径会沿途积累更多的信任评级,因此“更值得信赖”。
You are likely shouting “No!” The results in Example 8-24 show that the paths with the most trust value are also the longest paths. Longer walks through our data will aggregate more trust ratings along the way and therefore are “more trusted.”
我们正在开发的数据结构和寻路查询并未返回对应用程序有意义的结果。
The structure of our data and pathfinding queries in development are not returning results that make sense for an application.
您可能会在 Studio Notebook 中看到与示例 8-24不同的结果。这是因为前 15 条路径包括三条长度为 4 的路径(总共九个对象:五个顶点,四条边)。长度为 4 的路径不止三条;结果将包括发现的前三条。
You may see different results from Example 8-24 in your Studio Notebook. This is because the top 15 paths include three paths of length 4 (nine total objects: five vertices, four edges). There are more than three paths of length 4; and the results will include the first three that are discovered.
我们在开发过程中的探索给我们留下了两个需要解决的优化问题,以实现生产质量查询。首先,我们需要一种不同的方式来理解和使用权重来做出信任决策。trust现在数据集中呈现的方式并没有提供对这些数据用户有意义的结果。
Our exploration in development has left us with two optimizations we need to address for a production-quality query. First, we need a different way to understand and use weights in making our decision about trust. The way that trust is represented in the dataset now is not providing results that are meaningful to a user of this data.
我们需要的第二个优化是找到既短又高可信度的路径。在开发过程中,我们发现我们的工具可以按长度找到最短路径,也可以找到所有路径。而找到所有路径的成本太高了。我们需要一种不同的方法来找到最短加权路径,以满足我们的生产质量查询。
The second optimization we need is to find paths that are both short and with high trust. In development, we discovered that our tools can find either shortest paths by length or all paths. And it is too expensive to find all paths. We need a different approach for shortest weighted paths for our production-quality queries.
我们需要对这些数据上的边权重进行归一化,以便我们能够正确地找到最短的加权路径。这是下一章的主题和目标。
We need to normalize the edge weights on this data so that we can properly find shortest weighted paths. That is the theme and objective of the next chapter.
我们用一个想法开启了本章:人类自然而然地将概念之间的距离作为我们对这些概念之间的关联的信任程度的正相关。
We opened this chapter with an idea: the idea that humans naturally use distance between concepts as a positive correlation to how much we trust the association between those concepts.
为了量化我们的想法,我们定义了最短路径问题,介绍了通过图数据进行搜索的基本原理,并将这些概念应用于 Gremlin 查询语言。然后,我们的开发示例展示了如何使用路径量化网络中的信任来指导有关在比特币场外交易网络上进行交易的决策。
To quantify our idea, we defined the shortest path problem, walked through the fundamentals of searching through graph data, and applied those concepts with the Gremlin query language. Then our development examples showed how using paths to quantify trust in a network informs a decision about transacting on the Bitcoin OTC network.
然而,我们意识到,我们不能简单地将信任分数相加作为该网络中信任度的衡量标准,以量化我们最有价值的路径。为了发现数据中最可信的路径,我们需要引入两个用于路径查找生产用途的概念:规范化和查询优化。
However, we realized that we cannot simply add up trust scores as a measure of trust in this network to quantify our most valuable paths. To discover the most trusted paths in our data, we need to introduce two concepts for production use of pathfinding: normalization and query optimizations.
继续与我们一起进入下一章,了解团队如何共同发展他们的思维来解决生产中更复杂的问题:图数据中的最短加权路径。
Continue with us to the next chapter to learn how teams commonly evolve their thinking to address a more complex problem in production: shortest weighted paths in graph data.
1 Kelvin Lawrence,《实用 Gremlin:Apache TinkerPop 教程》,2020 年 1 月 6 日, https://github.com/krlawrence/graph。
1 Kelvin Lawrence, Practical Gremlin: An Apache TinkerPop Tutorial, January 6, 2020, https://github.com/krlawrence/graph.
2 Kumar, Srijan 等人,“加权符号网络中的边缘权重预测”, 2016 IEEE 第 16 届数据挖掘国际会议(ICDM),西班牙巴塞罗那,2016 年 12 月 12-15 日(新泽西州皮斯卡塔韦:电气电子工程师协会,2017 年),第 221-30 页。
2 Kumar, Srijan, et al. “Edge Weight Prediction in Weighted Signed Networks,” in 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, December 12–15, 2016 (Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2017), 221–30.
3 Srijan Kumar、Bryan Hooi、Disha Makhija、Mohit Kumar、Christos Faloutsos 和 VS Subrahmanian,“REV2:评分平台中的欺诈性用户预测”,载于WSM '18:第 11 届 ACM 国际网络搜索和数据挖掘会议论文集,加利福尼亚州马里纳德尔雷,2018 年 2 月 5 日至 9 日(纽约:ACM,2018 年),第 333–41 页。
3 Srijan Kumar, Bryan Hooi, Disha Makhija, Mohit Kumar, Christos Faloutsos, and V.S. Subrahmanian, “REV2: Fraudulent User Prediction in Rating Platforms,” in WSM ’18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining , Marina del Rey, California, February 5–9, 2018 (New York: ACM, 2018), 333–41.
4请参阅 DataStax Bulk Loader 文档: https://docs.datastax.com/en/dsbulk/doc/dsbulk/reference/schemaOptions.html#schemaOptions__schemaMapping。
4 See the DataStax Bulk Loader Documentation at https://docs.datastax.com/en/dsbulk/doc/dsbulk/reference/schemaOptions.html#schemaOptions__schemaMapping.
通常,我们考虑路径的第一个概念是从起点到终点需要停多少个站点。这是第 8 章的主题。
More often than not, the first concept we think about with paths is how many stops it takes to get from the start to the finish. This was the topic for Chapter 8.
处理图中路径的下一个概念是发展距离的概念。我们通过为路径上的步骤添加某种类型的权重或成本来实现这一点。我们将这类问题称为最小成本路径或最短加权路径。
The next concept when working with paths through graphs is to evolve the idea of distance. We do this by adding some type of weight or cost to steps along a path. We refer to this type of problem as a minimum cost path or a shortest weighted path.
最短加权路径是计算机科学和数学中非常流行的优化问题。这类问题往往是多方面的复杂优化问题,因为它们试图将多个信息源组合成最小化成本指标。
Shortest weighted paths are very popular optimization problems in computer science and mathematics. These types of problems tend to be multifaceted, complex optimization problems because they are trying to combine more than one source of information into a cost metric for minimization.
我们在第 8 章末尾看到了一个加权路径问题的例子。我们试图通过聚合路径权重来找到通过我们数据的最可信路径。由于我们的样本数据中高信任度由更高的值表示,因此这种类型的寻路问题导致我们发现,信任度更高的路径也是通过我们数据的更长路径。这不是我们想要的。
We saw an example of a weighted path problem at the end of Chapter 8. We tried to find the most trusted path through our data by aggregating path weights. Because high trust in our sample data is represented by higher values, this type of pathfinding problem led to the discovery that higher trust paths are also longer paths through our data. This is not what we wanted.
相反,我们需要了解如何使用边权重来寻找最短路径。通过数学和计算机科学的视角,我们希望创建一个有界的最小优化问题。
Instead, we need to understand how to use edge weights to find shortest paths. Through the lenses of mathematics and computer science, we want to create a bounded minimum optimization problem.
从这个意义上来说,高信任度与路径长度成反比。我们希望找到既短又具有高信任度的路径。这是我们将在本章中解决和优化的难题。
In this sense, high trust is inversely correlated with path length. We want to find paths that are simultaneously short and have high trust. This is the difficult duality we are going to address and optimize in this chapter.
There are three main sections in this chapter.
在第一部分中,我们将正式定义最短加权路径问题并介绍该算法。我们的寻路算法使用广度优先搜索来优化寻路以找到最短加权路径。
In the first section, we are going to formally define the shortest weighted path problem and walk through the algorithm. Our pathfinding algorithm uses breadth-first search to optimize pathfinding to find shortest weighted paths.
第二部分介绍了边权重归一化过程。我们将介绍将权重尺度从“越高越好”转变为“越低越好”的一般过程。我们将展示为样本数据集计算的新权重,创建新边,并重新加载示例的归一化信任分数。
The second section introduces the edge weight normalization process. We will walk through the general process of shifting and flipping our weights’ scale from “higher is better” to “lower is better.” We will show the new weights we calculated for our sample dataset, create a new edge, and reload the normalized trust scores for our example.
最后一部分将 A* 算法用于我们的规范化数据。我们将分解使用 Gremlin 查询语言编写 A* 的过程,并在示例数据上运行它,以查找您的公钥1094和您的公开邀请之间的最短加权路径1337。
The last section uses the A* algorithm on our normalized data. We will break down writing A* in the Gremlin query language and run it on our example data to find the shortest weighted paths between your public key 1094 and your open invite with 1337.
虽然您阅读本书的旅程已经很长了,但我们希望您能对我们接下来的示例充满信心。看到了吗?您已经将较长的路径与较高的信任度联系起来了。
Though your journey through this book has been long, we hope you have high trust in our upcoming examples. See? You are already correlating longer paths with higher trust.
我们已经尝试使用边缘权重在我们的寻路问题中。我们在第 8 章末尾介绍了sack()在比特币 OTC 信任网络中跨路径聚合信任评级的步骤时就做到了这一点。
We have already tried to use edge weights in our pathfinding problems. We did this at the end of Chapter 8 when we introduced the sack() step to aggregate trust ratings across paths in the Bitcoin OTC trust network.
然而,我们的流程效率低下,因为我们拥有的工具无法解决我们想要解决的问题。本节通过教授两种新工具来解决我们第一次尝试失败的两个原因。
However, our process was inefficient because the tools we had did not solve the problem we thought we were trying to solve. This section addresses two reasons our first attempt didn’t work by teaching two new tools.
首先,我们将定义最短加权路径问题并查看几个正确示例。然后,我们将介绍一种用于寻找最短加权路径问题解决方案的新算法,即 A* 搜索算法。稍后,当我们在 Gremlin 中构建 A* 搜索算法以在规范化的比特币 OTC 网络中查找最短加权路径时,您将看到这些工具。
First, we will define the problem for shortest weighted paths and look at a few correct examples. Then, we will introduce a new algorithm for finding solutions to shortest weighted path problems, the A* search algorithm. You will see these tools later when we build the A* search algorithm in Gremlin to find shortest weighted paths in the normalized Bitcoin OTC network.
让我们从新的问题定义开始。
Let’s get started with a new problem definition.
回想一下,我们在第 8 章中定义了最短路径。需要记住的是,图中的最短路径是从一个顶点走到另一个顶点所需的最少边数。
Recall that in Chapter 8 we defined shortest paths. As a refresher, the shortest path in a graph is the fewest number of edges it takes to walk from one vertex to another in the graph.
加权路径使用图数据中的属性来汇总和评分路径从起点到终点的加权距离。最短加权路径是得分最低的路径:
A weighted path uses properties from your graph data to aggregate and score a path’s weighted distance from start to end. The shortest weighted path is the path with the lowest score:
最短加权路径发现图中两个顶点之间的路径,使得边权重的总和最小。
The shortest weighted path discovers the path between two vertices in a graph such that the total sum of the edges’ weights is the minimum.
让我们使用一个具体的例子;图 9-1为我们在图 8-4中的示例图添加了一些权重。
Let’s use a concrete example; Figure 9-1 adds some weights to our example graph from Figure 8-4.
A到 的最短加权路径D图 9-1使用加粗的边来说明从顶点 A 到顶点 D 的最短加权路径。
Figure 9-1 uses bolded edges to illustrate the shortest weighted path from vertex A to vertex D.
A从到 的最短加权路径的总权重D为 6。将此权重与图中的最短路径进行对比。图中的最短路径为A → D,权重为 10。此路径不是最短加权路径,因为 A → B → C → D 的权重较低,为 6。
The total weight of the shortest weighted path from A to D is 6. Contrast this weight with the shortest path in the graph. The shortest path in the graph is A → D and has a weight of 10. This path is not the shortest weighted path because A → B → C → D has a lower weight of 6.
新问题会带来新方法。在图 9-1的小例子中,我们可以快速看到最短路径。
With a new problem comes new approaches. In the small example in Figure 9-1, we can quickly see the shortest path.
对于较大的图,我们需要进行多项优化。到目前为止,我们从 BFS 和 DFS 获得的唯一优化是跟踪访问过的顶点集,这样我们就不会重复探索相同的空间。
For larger graphs, we need to fold in multiple optimizations. The only optimization we have so far from BFS and DFS tracks the visited set of vertices so that we do not repeat exploring the same space.
但在处理加权图时,我们可以更聪明。让我们深入研究图数据中的最短加权路径。
But we can be smarter when we are working with weighted graphs. Let’s delve into shortest weighted paths in graph data.
在 Google 上快速搜索“图路径算法”会返回详尽的列表,包括 A*(发音为“A 星”)、Floyd-Warshall 和 Dijkstra 算法,仅举几例。我们正在深入研究这些算法所应用的优化,以教您适用于任何方法的基础知识。不同搜索算法的成本和收益来自于了解它们如何通过不同的创意优化来减少搜索空间。
A quick Google search on “graph path algorithms” returns an extensive list, including A* (pronounced “A star”), Floyd-Warshall, and Dijkstra’s, to name a few. We are zooming in on the optimizations that these algorithms apply to teach you the fundamentals that apply to any approach. The costs and benefits of different searching algorithms come from understanding how they reduce the search space with different creative optimizations.
不同的算法通过沿途应用一些优化来解决图的最短加权路径问题。图搜索算法从起始顶点维护路径树,并应用启发式方法来决定是否应将新边添加到工作树中。从高层次来看,其中一些优化包括:
Different algorithms solve the shortest weighted path problem for graphs by applying a few optimizations along the way. A graph search algorithm maintains a tree of paths from the starting vertex and applies heuristics to decide whether a new edge should be added into the working tree. At a high level, some of those optimizations include:
The lowest cost optimization excludes an edge if the edge’s destination is reachable via a lower cost path.
Supernode avoidance excludes a vertex if its degree would increase the search space complexity over a threshold.
A global heuristic excludes an edge if the edge’s weight causes the path’s total weight to exceed a threshold.
您可以使用多种启发式方法来优化图算法。选择好的启发式方法需要了解您的数据、其分布以及在寻路过程中要避免的图结构。
There are a myriad of heuristics you can apply to optimize your graph algorithm. Choosing good heuristics requires understanding your data, its distributions, and the graph structures you want to avoid during pathfinding.
这里定义的第二个优化指出,您可以通过消除超节点来优化搜索空间。让我们简要介绍一下超节点的定义,并解释为什么您需要使用启发式方法将它们从搜索空间中移除。
The second optimization defined here notes that you could optimize your search space by eliminating supernodes. Let’s take a brief side tour to define supernodes and explain why you would want to use a heuristic to remove them from your search space.
超节点的概念是,它是一个具有极多边的顶点。这就是超级概念的由来;超节点是图数据中高度连接的顶点。
The idea of a supernode is that it is a vertex with an extremely high number of edges. That is where the idea of super comes from; a supernode is a highly connected vertex in your graph data.
超节点是具有不成比例的高度的顶点。
A supernode is a vertex with a disproportionately high degree.
举一个直接的例子,想想 Twitter 的社交网络。你想在脑海中画出一个 Twitter 账户图,其中边代表谁关注了谁。超级节点是与网络其他部分相比关注者数量非常多的顶点。Twitter 上的大多数名人都是超级节点的很好例子。
For a direct example, think about Twitter’s social network. You want to mentally draw out a graph of Twitter accounts in which the edges are who follows whom. A supernode is a vertex with a very high number of followers compared to the rest of the network. Most celebrities on Twitter are good examples of supernodes.
有趣的事实:在构建 Apache Cassandra 的早期,团队开发了计数器来跟踪 Twitter 帐户的关注者数量。这被称为 Ashton Kutcher 问题,因为他是 Twitter 上第一个拥有 1,000,000 名关注者的人。关注者数量使 Ashton Kutcher 的帐户成为 Twitter 网络中的超级节点。
Fun fact: in the early days of building Apache Cassandra, the team developed counters to track the number of followers for a Twitter account. This was known as the Ashton Kutcher problem, as he was the first to reach 1,000,000 followers on Twitter. The volume of followers makes Ashton Kutcher’s account a supernode in the Twitter network.
就寻路而言,如果你遍历到超节点,你可能会在优先级队列中添加数百万条边。由于寻路需要考虑许多新边,这将大大增加遍历的计算成本。
As it relates to pathfinding, if you traverse into a supernode, you potentially add millions of edges to your priority queue. This will blow up the computational cost of your traversal due to the many new edges to consider for pathfinding.
为此,让我们来了解一下超级节点的一些理论局限性。
To this end, let’s walk through some theoretical limitations with supernodes.
在 Apache Cassandra 中,一个分区最多可以包含 20 亿个单元。DataStax Graph 中的边缘表需要每个端点顶点的主键,因此每条边至少需要两个单元。但要获得唯一的边,您需要在边上使用某种类型的通用唯一标识符 (UUID)。因此,当达到在磁盘上存储超节点的上限时,边分区中的最小单元数为 3:20 亿除以 3 个最大单元。
In Apache Cassandra, a partition can contain, at most, two billion cells. An edge table in DataStax Graph requires the primary keys for each endpoint vertex, therefore requiring a minimum of two cells per edge. But to get unique edges, you need some type of universally unique identifier (UUID) on the edge. Thus, the minimum number of cells in an edge’s partition is three: two billion divided by three max cells when you reach the uppermost limit of storing a supernode on disk.
这意味着在 DataStax Graph 中,一个具有 666,666,666 条边的单个顶点距离达到 Apache Cassandra 表中单元格数量的磁盘限制只有一条边的距离。这可不是什么好兆头。
That means that in DataStax Graph, a single vertex with 666,666,666 edges is one edge away from hitting the limit on disk for the number of cells in an Apache Cassandra table. That’s ominous.
无论如何,在磁盘上创建超节点之前,您就会在遍历中遇到处理超节点的障碍。要了解这一点,请回想一下我们在第 6 章中发现的处理限制。由于我们的图对相对低度顶点的分支因子,我们遇到了处理限制。可以肯定地说,您很可能在达到磁盘限制之前就对超节点的处理性能进行了故障排除。
Regardless, you will hit a snag with processing supernodes in a traversal well before you create one on disk. To see this, think back to the processing limitations we discovered in Chapter 6. We ran into processing limitations due to our graph’s branching factor for relatively low degree vertices. It is safe to say that you are likely to be troubleshooting the processing performance of supernodes well before you reach limitations on disk.
在即将实施的方案中,我们将彻底消除超级节点。我们将在下一节中概述如何应用这项技术以及其他一些优化措施。
Our approach with supernodes in our upcoming implementation will be to eliminate them entirely. Let’s outline how we will apply this technique, and a few more optimizations, in the next section.
让我们首先了解我们要构建的算法的伪代码方法。我们将在后续部分中在 Gremlin 中实现 BFS 算法,并针对我们的数据集进行特定优化。
Let’s first understand the pseudocode approach for the algorithm we are going to build. We will implement a BFS algorithm in Gremlin with optimizations specific to our dataset in a future section.
我们将对加权比特币场外交易网络的寻路应用三项优化:
We are going to apply three optimizations to pathfinding in our weighted Bitcoin OTC network:
Lowest cost optimization excludes the edge if we have already found a shorter path to the next vertex.
Supernode avoidance excludes an edge if the destination vertex has too many outgoing edges.
Global heuristic excludes the edge if the edge’s weight causes the path’s total weight to exceed the maximum value we want to consider.
例 9-1中的伪代码描述了我们将在本章中实现的算法。
The pseudocode in Example 9-1 describes the algorithm that we will be implementing in this chapter.
最短加权路径(G,起点,终点,h)
使用 sack 将路径距离初始化为 0.0
找到你的起始顶点 v1
重复
移至出边
将袋子值增加边权重
从边缘移动到传入顶点
如果路径是循环路径,则删除该路径
创建一个地图;键是顶点,值是最小距离
O1 如果遍历器的路径比到当前 v 的最小路径长,则删除该遍历器
O2 如果遍历器走进具有 100+ 个传出边的超级节点,则将其移除
O3 如果遍历器的距离大于全局启发式距离,则删除该遍历器
检查路径是否达到 v2
按路径总距离值排序
允许前 x 条路径继续
塑造结果ShortestWeightedPath(G, start, end, h)
Use sack to initialize the path distance to 0.0
Find your starting vertex v1
Repeat
Move to outgoing edges
Increment the sack value by the edge weight
Move from edges to incoming vertices
Remove the path if it is a cycle
Create a map; the keys are vertices, value is the minimum distance
O1 Remove a traverser if its path is longer than the min path to the current v
O2 Remove a traverser if it walked into a supernode with 100+ outgoing edges
O3 Remove a traverser if its distance is greater than a global heuristic
Check if the path reached v2
Sort the paths by their total distance value
Allow the first x paths to continue
Shape the result让我们来看看示例 9-1中描述的过程,因为我们将在本章中用 Gremlin 实现它。我们的方法让每个遍历器从起始顶点的距离 0.0 开始。然后,循环条件将遍历器移动到一条边上,更新遍历器的总距离,并应用一系列过滤器来确定遍历器是否应该继续探索图。我们继续该循环过程,直到找到x满足所有优化和过滤器的路径数。
Let’s walk through the process we described in Example 9-1 because we will be implementing it in Gremlin in this chapter. Our approach starts every traverser with a distance of 0.0 on the starting vertex. Then a looping condition moves a traverser onto an edge, updates the traverser’s total distance, and applies a series of filters to determine whether the traverser should continue exploring the graph. We continue that looping process until we find x number of paths that satisfy all of the optimizations and filters.
我们所指的一系列过滤器在示例 9-1中标记为 O_n_ ,并应用了我们刚刚概述的三种优化方法。示例 9-1O1中标记为 的行显示了我们将如何应用最低成本优化;如果我们已经找到了到达其位置的更短路径,则将删除遍历器。标记为 的行应用了一种全局启发式方法,如果遍历器的路径权重过高,则会删除该遍历器,因为这样的路径(可能)对您的应用程序没有任何意义。最后,标记为 的超节点避免优化通过对顶点的度数设置硬限制从我们的寻路算法中过滤掉超节点。O2O3
The series of filters we are referring to are labeled with O_n_ in Example 9-1 and apply each of the three optimizations we just outlined. The line labeled O1 in Example 9-1 shows how we will apply the lowest cost optimization; a traverser will be removed if we have already found a shorter path to its location. The line labeled O2 applies a global heuristic that removes a traverser if its path reaches too high of a weight, because such paths (probably) do not mean anything to your application. Last, the supernode avoidance optimization, labeled O3, filters out supernodes from our pathfinding algorithm by setting a hard limit on a vertex’s degree.
我们几乎已经准备好构建实现示例 9-1 的Gremlin 语句了。为了帮助我们实现这一点,让我们讨论如何解决边权重问题从第8章结尾开始。
We are almost ready to build up the Gremlin statements that implement Example 9-1. To help get us there, let’s talk about how to address the edge weight problem from the end of Chapter 8.
数据集量化信任的方式是我们在第 8 章中发现的最大障碍。就现在的数据而言,我们无法使用边权重来找到最短加权路径,因为最可信的路径是最长的路径。
The way the dataset quantifies trust was the biggest hurdle we found during Chapter 8. With the way the data is now, we do not have a way to use the edge weights to find shortest weighted paths because the most trusted paths would be the longest ones.
我们需要转换边权重以便用它们来找到最短加权路径。
We need to transform the edge weights to use them to find shortest weighted paths.
即将进行的转换有两个作用。首先,它应用对数,以便我们可以有意义地添加权重以找到最大信任路径。其次,我们必须翻转比例,以便最小加权路径与最大信任相关。
The upcoming transformation does two things. First, it applies logarithms so that we can meaningfully add weights to find maximum trust paths. Second, we have to flip the scale so that a minimum weighted path correlates to maximum trust.
本节将介绍如何进行这种转换。然后我们将更新数据集和图。最后,我们将查看数据中的几条路径,并展示如何有意义地解释新的边权重。
This section walks through how to do this transformation. Then we will update our dataset and graph. Last, we will look at a few paths in the data and show how to meaningfully interpret the new edge weights.
数据转换过程分为三个步骤:
There are three steps to the data transformation process:
将比例移至区间 [0,1]。
Shift the scale to the interval [0,1].
将新的尺度构建为最短路径问题。
Frame the new scale as a shortest path problem.
决定如何处理建模无限性。
Decide how to handle modeling infinity.
让我们来看看这三个步骤以及为什么我们需要这样做。我们将向您展示体重秤在最后如何改变体重。
Let’s walk through all three of these steps and why we need to do them. We will show you how the scale transforms the weights at the very end.
信任区间原始数据集的范围从 -10 到 10,其中 -10 表示不信任,10 表示绝对信任。图 9-2显示了使用 DataStax Studio 中的 Gremlin 得到的数据集中观测值的分布。
The trust interval in the original dataset ranges from –10 to 10, where –10 represents no trust and 10 represents absolute trust. Figure 9-2 shows the distribution of observations in the dataset using Gremlin in DataStax Studio.
我们的目标是将区间 [-10,10] 中的信任分数映射到 [0,1] 范围内。映射到 [0,1] 上使我们能够创建置信度类型的分数,这样将两个分数相乘便可以为我们提供一种数学上合理的方法来对信任的聚合进行建模。这种技术类似于我们用数学方法推理概率的方式。
Our objective is to map the trust scores from the interval [-10,10] onto the scale [0,1]. Mapping onto [0,1] gives us a way to create a confidence type score such that multiplying two scores gives us a mathematically sound way to model the aggregation of trust. This technique is similar to how we mathematically reason about probabilities.
换句话说,将负分和正分混合在一起并不能描述我们如何从数学上推理用户评级;我们需要一个更一致的尺度。
In other words, mixing negative and positive scores together doesn’t describe how we can mathematically reason about user ratings; we need a more consistent scale.
此外,我们在图 9-2中看到,没有信任值为 0 的评级。因此,我们决定将评级 1 指定为“犹豫不决”。并且,我们将从映射中删除“0”。此映射为我们的移位比例提供了以下起点:
Additionally, we see in Figure 9-2 that there are no ratings with a trust value of 0. Therefore, we have decided that the rating of 1 will designate being “on the fence.” And, we will remove “0” from the mapping. This mapping gives us the following starting points for our shifted scale:
评级为 -10 则对应为 0,表示不信任。
A rating of –10 maps to 0 to mean no trust.
1 映射到 0.5,表示“处于观望状态”。
1 maps to 0.5 to mean “on the fence.”
10 映射到 1 表示最大程度的信任。
10 maps to 1 to mean maximum trust.
我们将其余的评分线性地填入这些间隔中。线性变换创建了0.05-10 到 1 之间的增量以及0.055562 到 10 之间的增量。我们通过以下方式计算这些增量:
We will fill in the rest of the ratings linearly into those intervals. The linear transformation creates increments of 0.05 between –10 and 1 and increments of 0.05556 from 2 to 10. We calculated these increments via:
范围/总数 = 0.5/10 = 0.05,
= 0.5/9 = 0.05556range/total_numbers = 0.5/10 = 0.05,
= 0.5/9 = 0.05556
完整的映射表如图 9-4所示。
The full table of mappings is coming up in Figure 9-4.
我们还不能使用 0 到 1 之间的值来计算边上的最短路径,因为较高的分数仍然与较高的信任度相关。我们将遇到与第 8 章相同的问题:路径越长,信任度越高。为了达到我们需要的目的,我们必须讨论另外两个数学变换。
We can’t yet use the values between 0 and 1 to calculate shortest paths on the edges because higher scores still correlate to high trust. We will run into the same problem as in Chapter 8: longer paths have higher trust. To get to where we need to be, we have to discuss two more mathematical transformations.
我们本质上是试图在我们的数据中找到两个地址之间最高信任的路径。为了将其定义为最短路径问题,我们必须做两件事:
We are essentially trying to find the highest trust path between two addresses in our data. To frame that as a shortest path problem, we have to do two things:
使用对数,使乘法变成加法。
Use logarithms so that multiplication becomes addition.
将结果乘以–1,使得最大值变为最小值。
Multiply the result by –1, so that the maximum becomes a minimum.
这里的第一步是理解一个重要的转变,以便您可以准确地模拟数据中的某些现象,例如信任,所以让我们来逐步介绍一下。
The first step here is an important transformation to understand so that you can accurately model certain phenomena in data, such as trust, so let’s walk through it.
在许多情况下,不需要对边权重使用对数,因为您可以简单地将权重相加,就像在物流示例中一样。
In many cases, using logarithms for edge weights isn’t necessary because you can simply add up the weights, as in the logistics example.
然而,在某些情况下,你需要乘法而不是加法。当你处理概率、置信度值等时,情况确实如此。
However, in some cases, you need to multiply instead of add. This is true when you are dealing with probabilities, confidence values, and so on.
信任本质上是一种信心值。从数学上讲,这意味着“你对别人信任的信任”是将这两个概念相乘,而不是将它们相加。
Trust is essentially a confidence value. Mathematically, this means that “your trust of someone else’s trust” multiplies those two concepts, rather than adding them together.
让我们想一想。如果你一半信任 A,而 A一半信任 B,你会得出你完全信任 B 的结论吗?不,你可能不会。相反,你会得出你有点不信任 B 的结论。
Let’s think about it. If you half trust person A, and person A half trusts person B, do you conclude that you fully trust person B? No, you probably don’t. Instead, you conclude that you somewhat distrust person B.
您对此的推理方式是将信任分数相加和相乘之间的区别。如果您决定完全信任 B,那么您将添加一半的信任和一半的 A 的信任来得出结论。从逻辑上讲,这没有意义,因为我们处理的是你对别人意见的信心。当您推理出您有点不信任 B 时,您(本质上)将您的一半信任乘以 A 的一半信任,得出大约“0.25”的总信任。
How you are reasoning about this is the difference between adding trust scores and multiplying them. If you decided that you fully trusted person B, you would be adding half of your trust and half of person A’s trust to reach your conclusion. Logically, this doesn’t make sense, because we are dealing with your confidence in someone else’s opinion. When you reason that you somewhat distrust person B, you are (essentially) multiplying your half trust by person A’s half trust to arrive somewhere around “0.25” total trust.
为了用数字表示这一点,我们必须应用对数变换使用 0 到 1 之间的值。使用对数允许我们将信任分数相加,而不是相乘。此变换为我们的分数提供了以下值:
To represent this numerically, we have to apply a logarithmic transformation to use the values between 0 and 1. Using logarithms allows us to add the trust scores together instead of multiplying them. This transformation gives us the following values for our scores:
–10 映射到 0;log(0) = negative infinity
–10 maps to 0; log(0) = negative infinity
1 映射到 0.5;log(0.5) = -0.301
1 maps to 0.5; log(0.5) = -0.301
10 映射到 1;log(1) = 0
10 maps to 1; log(1) = 0
步骤 2 的后半部分表示最终的转换是将这些分数乘以–1。这最后一步是必需的,以便最大值变为最小值;我们需要最小值来找到最短的加权路径。
The second half of step 2 indicates that the final transformation is to multiply these scores by –1. This last step is required so that the maximum becomes a minimum; we need minimums for finding shortest weighted paths.
为了说明这种映射,图 9-3绘制了我们的转换图。在 y 轴上,0 表示不信任,1 表示信任。x 轴显示了较高的信任分数与较低的信任之间的相关性。
To illustrate this mapping, Figure 9-3 plots our transformation. On the y-axis, 0 means distrust and 1 means trust. The x-axis shows how higher trust scores correlate to lower trust.
需要考虑的点是图 9-3所示的点。
The point to consider is the point shown in Figure 9-3.
信任分数从信任转为不信任的临界点是0.30103。分数小于0.30103代表信任,而分数大于0.30103代表不信任。
The point at which a trust score flips from trust to distrust is 0.30103. A score of less than 0.30103 will represent trust, whereas value greater than 0.30103 represents distrust.
将高信任度分数转换为低信任度分数使我们能够将分数相加,这样总分越低,总信任度就越高。能够找到最低分数使我们能够优化以找到最小的总权重。从这里开始,我们可以应用这些新权重来推理应用程序中的最短加权路径。
The transformation of high trust to low scores gives us the ability to add scores together such that lower total scores mean higher total trust. Being able to find lowest scores gives us an optimization to find the smallest total weight. From here, we can apply these new weights to reason about shortest weighted paths in our application.
还需要做出最后一个决定:如何表示(–1)*log(0) = infinity我们的数据。
There is one last decision required: how to represent (–1)*log(0) = infinity in our data.
您的团队必须权衡如何表示(–1)*log(0) = infinity数据的一些决定。您想要选择一个足够大的值,以便具有该值的路径几乎没有机会成为较短的加权路径,但又不能太大以至于它的值比根本没有边更糟糕。
There are a few decisions your team has to weigh about how to represent (–1)*log(0) = infinity in your data. You want to select a value large enough so that a path with this value has little chance of being a shorter weighted path, but not so large that its value is worse than no edge at all.
我们选择了值100来表示 的分数(–1)*log(0)。让我们思考一下为什么这是一个不错的选择。考虑任意端点顶点a和,b其边权重为 100。a和之间的加权路径b几乎保证比图中的任何其他路径都长。您必须找到一条有 101 条边的路径,每条边的权重为 1,才能使a和之间的直接路径b成为一条较短的加权路径。在我们的问题中,长度为 101 的路径对我们的应用程序来说实际上没有意义。因此,我们认为100对于我们的例子来说,选择 已经足够好了。
We selected the value 100 to represent the score of (–1)*log(0). Let’s think about why this is a decent choice. Consider arbitrary endpoint vertices a and b with an edge weight of 100. The weighted path between a and b is almost guaranteed to be longer than any other path in the graph. You would have to find a path of 101 edges, with each edge having a weight of 1, for the direct path between a and b to be a shorter weighted path. In the context of our problem, a path of length 101 doesn’t really make sense to our application. As a result, we feel the choice of 100 for our example is good enough.
图 9-4所示的值详细说明了我们刚刚在前几节中讨论的步骤。我们首先将 [-10, 10] 移至 [0,1]。然后,我们对每个值取对数,并将结果乘以–1。最终分数设置(–1)*log(0)为 100。
The values shown in Figure 9-4 detail the steps we just discussed in the past few sections. We first shifted [–10, 10] to [0,1]. Then, we took the logarithm of each value and multiplied the result by –1. The final scores set (–1)*log(0) to 100.
接下来,我们需要更新图的模式并加载边的转换版本,以便我们可以使用这些新的权重。
Next, we need to update our graph’s schema and load the transformed version of the edges so that we can use these new weights.
我们希望增强当前的rated优势,以获得新的规范化值。我们将这样做norm_trust通过在边上添加一个属性rated作为聚类键来实现这一点。图 9-5显示了新的图模型,并表明我们正在将新属性设为边的聚类键。指示norm_trust边的聚类键rated将按升序对磁盘上的边进行排序。
We want to augment our current rated edge to have the new normalized values. We will do that by adding a property called norm_trust onto the rated edge as the clustering key. Figure 9-5 shows the new graph model and indicates that we are making the new property the clustering key for the edges. Indicating that norm_trust is an edge’s clustering key will sort the rated edges on disk in increasing order.
图 9-5的模式代码在示例 9-2中。我们希望你正在学习如何使用图模式语言 (GSL) 将图数据模型转换为模式代码,就像使用 ERD 创建表。
The schema code for Figure 9-5 is in Example 9-2. We hope you are learning how to translate graph data models to schema code with the Graph Schema Language (GSL), just like using an ERD to create tables.
schema.vertexLabel("Address").ifNotExists().partitionBy("public_key",Text).create();schema.edgeLabel("rated").ifNotExists().from("Address").to("Address").clusterBy("norm_trust",Double,Asc).property("datetime",Text).create()
schema.vertexLabel("Address").ifNotExists().partitionBy("public_key",Text).create();schema.edgeLabel("rated").ifNotExists().from("Address").to("Address").clusterBy("norm_trust",Double,Asc).property("datetime",Text).create()
正如我们在第 8 章中所做的那样,我们将使用命令行工具 DataStax Bulk Loader 将数据加载到图中。 本文附带的数据集已经进行了边权重的转换。如果您想查看代码,请前往本书 GitHub 存储库中的第 9 章数据目录,获取这些示例的数据和加载脚本。
As we did in Chapter 8, we are going to load the data into our graph using the DataStax Bulk Loader, a command-line tool. The datasets that accompany this text already have the transformation of the edge weights. If you would like to see the code, please head to the Chapter 9 data directory within this book’s GitHub repository for the data and loading scripts for these examples.
让我们做一些基本的探索性查询,以确保我们理解我们的数据并且它被正确加载。
Let’s do some basic exploratory queries to ensure that we understand our data and that it loaded correctly.
在我们开始实现最短加权路径之前,让我们先看一下第 8 章中的相同查询。然而,这次我们想norm_trust在探索1094和之间的路径时使用该属性1337。
Before we get into implementing shortest weighted paths, let’s look at our same queries from Chapter 8. This time, however, we want to use the norm_trust property as we explore paths between 1094 and 1337.
本节中我们将执行的两个查询是:
The two queries we will do in this section are:
查找所有长度为 2 的路径,按总信任度排序
Find all paths of length 2, sorted by total trust
按路径长度查找 15 条最短路径,并按总信任度排序
Find the 15 shortest paths by path length, sorted by total trust
让我们从第一个查询开始。
Let’s start with the first query.
在示例 9-3中,我们重新访问第 8 章中的长度为 2 的相同路径,但使用标准化权重计算信任距离。
In Example 9-3, we are revisiting the same path of length 2 from Chapter 8 but are calculating the trust distance using the normalized weights.
1g.withSack(0.0).2V().has("Address","public_key","1094").3repeat(outE("rated").4sack(sum).5by("norm_trust").6inV()).7times(2).8has("Address","public_key","1337").9order().10by(sack(),asc).11project("path_information","total_elements","trust_distance").12by(path().by("public_key").by("norm_trust")).13by(path().count(local)).14by(sack())
1g.withSack(0.0).2V().has("Address","public_key","1094").3repeat(outE("rated").4sack(sum).5by("norm_trust").6inV()).7times(2).8has("Address","public_key","1337").9order().10by(sack(),asc).11project("path_information","total_elements","trust_distance").12by(path().by("public_key").by("norm_trust")).13by(path().count(local)).14by(sack())
The raw results of Example 9-3 are shown in Example 9-4:
{"path_information":{"labels":[[],[],[],[],[]],"objects":["1094","0.0248","1268","0.30103","1337"]},"total_elements":"5","trust_distance":"0.32583"}
{"path_information":{"labels":[[],[],[],[],[]],"objects":["1094","0.0248","1268","0.30103","1337"]},"total_elements":"5","trust_distance":"0.32583"}
正如我们在第 8 章中发现的那样,起始顶点和结束顶点之间只有一条长度为 2 的路径。示例 9-4中的路径对象与我们在第 8 章中找到的权重相结合,如图 9-6所示。
As we found during Chapter 8, there is only one path of length 2 between our start and end vertices. The path object from Example 9-4, combined with the weights we found in Chapter 8, is illustrated in Figure 9-6.
1094观察和之间的数据中唯一长度为 2 的路径上的归一化边权重1337图 9-6中所示路径的总信任距离为 0.32583。您可以反转此分数以了解它如何适应移位的 [0, 1] 尺度。为此,您将最终分数乘以–1,然后将结果乘以 10 10^(–1*(0.0248 + 0.3010)) = 0.4723:。
The total trust distance for the path illustrated in Figure 9-6 is 0.32583. You can reverse this score to understand how it would fit into the shifted [0, 1] scale. To do that, you multiply the final score by –1 and then raise 10 to the power of the result: 10^(–1*(0.0248 + 0.3010)) = 0.4723.
这意味着这条路径在 [0,1] 范围内的加权信任度为 0.4723。因此,我们稍微不信任这条路径,因为 0 表示不信任,1 表示信任。这条路径的加权信任度得分略小于0.5,因此稍微不信任。
This means that the weighted trust of this path on a scale of [0,1] is 0.4723. Thus, we slightly distrust this path because 0 means distrust and 1 means trust. This path’s weighted trust score is slightly less than 0.5 and is therefore slightly distrusted.
你可能会想:但是其他路径呢?那么,让我们看看第 8 章中的第二个查询。
You may be wondering: but what about other paths? So, let’s look at our second query from Chapter 8.
快速复习一下,请记住,我们在第 8 章中构建的查询结合了我们对 Gremlin 中的障碍的知识和广度优先搜索的逻辑。查询将这些概念应用于按路径长度(而不是按权重)保证最短路径。
For a quick refresher, remember that the queries we built up in Chapter 8 combined our knowledge of barriers in Gremlin and the logic of breadth-first search. The queries applied these concepts to guaranteed shortest paths by path length, not by weight.
我们应用示例 9-5中的最短路径逻辑来按长度查找 15 条最短路径,然后按其规范化信任距离对这些路径进行排序。
We apply the shortest path logic in Example 9-5 to find the 15 shortest paths by length, but then order those paths by their normalized trust distance.
让我们看看示例 9-5中的查询。
Let’s see the query in Example 9-5.
1g.withSack(0.0).// init each traverser to have a value of 0.02V().has("Address","public_key","1094").// start at 10943repeat(// repeat4outE("rated").// walk out to an edge and stop5sack(sum).// aggregate into the traverser's sack6by("norm_trust").// the value on the edge's property: "norm_trust"7inV().// move and walk into the next vertex8simplePath()).// remove the traverser if it has a cycle9until(has("Address","public_key","1337")).// until you reach 133710limit(15).// BFS: first 15 are the 15 shortest paths, by length11order().// sort the 15 paths12by(sack(),asc).// by their aggregated trust scores13project("path_information","total_elements","trust_distance").// make a map14by(path().by("public_key").by("norm_trust")).// first value: path information15by(path().count(local)).// second value: length16by(sack())// third value: trust
1g.withSack(0.0).// init each traverser to have a value of 0.02V().has("Address","public_key","1094").// start at 10943repeat(// repeat4outE("rated").// walk out to an edge and stop5sack(sum).// aggregate into the traverser's sack6by("norm_trust").// the value on the edge's property: "norm_trust"7inV().// move and walk into the next vertex8simplePath()).// remove the traverser if it has a cycle9until(has("Address","public_key","1337")).// until you reach 133710limit(15).// BFS: first 15 are the 15 shortest paths, by length11order().// sort the 15 paths12by(sack(),asc).// by their aggregated trust scores13project("path_information","total_elements","trust_distance").// make a map14by(path().by("public_key").by("norm_trust")).// first value: path information15by(path().count(local)).// second value: length16by(sack())// third value: trust
示例 9-6显示了示例 9-5的结果,我们最受信任的前三条路径显示出非常有趣的结果。在第 8 章中,我们找到了最短路径:1094 → 1268 → 1337。示例 9-6表明这条路径是 15 条最短路径中第二受信任的路径,这意味着我们可以得出结论,存在一条更长且更受信任的路径。
Example 9-6 displays the results of Example 9-5, and our top three most trusted paths show very interesting results. In Chapter 8, we found the shortest path: 1094 → 1268 → 1337. Example 9-6 shows that this path is the second most trusted path of the 15 shortest paths, which means we can conclude that there is a longer path that is also more trusted.
{"path_information":{"labels":[[],[],[],[],[],[],[]],"objects":["1094","0.2139","280","0.0512","35","0.0248","1337"]},"total_elements":"7","trust_distance":"0.2899"},...,{"path_information":{"labels":[[],[],[],[],[]],"objects":["1094","0.0248","1268","0.30103","1337"]},"total_elements":"5","trust_distance":"0.32583"},{"path_information":{"labels":[[],[],[],[],[],[],[]],"objects":["1094","0.0248","1268","0.30103","35","0.0248","1337"]},"total_elements":"7","trust_distance":"0.35063"},...
{"path_information":{"labels":[[],[],[],[],[],[],[]],"objects":["1094","0.2139","280","0.0512","35","0.0248","1337"]},"total_elements":"7","trust_distance":"0.2899"},...,{"path_information":{"labels":[[],[],[],[],[]],"objects":["1094","0.0248","1268","0.30103","1337"]},"total_elements":"5","trust_distance":"0.32583"},{"path_information":{"labels":[[],[],[],[],[],[],[]],"objects":["1094","0.0248","1268","0.30103","35","0.0248","1337"]},"total_elements":"7","trust_distance":"0.35063"},...
例 9-6显示了我们在第 8 章中找不到的结果:一条更长、更可信的路径。15 条最短路径中最可信的是一条长度为 3, 的路径1094 → 280 → 35 → 1337,总可信度得分为0.2899。
Example 9-6 displays results that we couldn’t find in Chapter 8: a path that is longer and more trusted. The most trusted of the 15 shortest paths is a path of length 3, 1094 → 280 → 35 → 1337, with a total trust score of 0.2899.
示例 9-6中的结果是按长度排序的最短路径,并按其信任距离排序。这与最短加权路径不同,我们尚未对其进行排序。
The results in Example 9-6 are the shortest paths by length, sorted by their trust distance. This is not the same as shortest weighted paths, which we have not done yet.
找到一条更长且信任度更高的路径令人兴奋。但是,这个值0.2899意味着什么?我们对这条路径的信任度有多高?
It is exciting to have found a longer path with a better trust score. However, what does the value 0.2899 mean? How much do we trust this path?
图中的权重代表了标准化的信任距离。这保证了最短的加权路径是最可信的路径。
The weights in our graph represent a normalized trust distance. This guarantees that the shortest weighted path is the most trusted path.
最终,你想说,“我是否信任这条路径?”要回答这个问题,你必须将路径的最终权重转换回图 9-4中的移位比例。你必须转换路径的总信任距离,以说明你是否信任这条路径。
Ultimately, you want to say, “Do I trust this path or not?” To answer that, you have to convert the path’s final weight back to the shifted scale from Figure 9-4. You have to convert the path’s total trust distance to make a statement about whether you trust or distrust this path.
让我们深入研究路径的信任距离如何映射回我们的信任尺度[0,1]。
Let’s deeply examine how a path’s trust distance maps back to our trust scale of [0,1].
最佳最短路径的权重为零;当路径中所有边的标准化权重都为 0 时,就会发生这种情况。路径的信任距离为 0 表示该路径具有最高的信任分数 1 10^(–0)=1:。
The best possible shortest path has a weight of zero; this would happen when all edges in the path have a normalized weight of 0. A path’s trust distance of 0 converts to that path having the highest trust score of 1: 10^(–0)=1.
For all trust distances d, the conversion formula is shown in Figure 9-7 and Figure 9-3.
d,将路径距离转换为 [0,1] 信任尺度的公式对于示例 9-6中的所有三个结果,让我们转换它们的总权重。每条路径及其转换如下:
For all three results from Example 9-6, let’s convert their total weight. Each path and their conversion is:
包含 7 个对象的顶部路径:10^(-0.28990) = 0.5130
Top path with seven objects: 10^(–0.28990) = 0.5130
包含五个对象的最短路径:10^(-0.32583) = 0.4722
Shortest path with five objects: 10^(–0.32583) = 0.4722
第三条路径有七个对象:10^(-0.35063) = 0.4460
Third path with seven objects: 10^(–0.35063) = 0.4460
上述三个转换后的分数表示每条路径在 [0,1] 范围内的总信任度,其中 0 表示不信任,1 表示信任。这意味着,在长度最短的 15 条路径中,我们发现了一条我们稍微信任的路径。示例 9-6中的顶部结果的聚合归一化权重为0.28990,这转换为我们 +[0,1] 范围内的信任分数0.5130。因此,我们稍微信任这条路径。
The three converted scores above represent each path’s total trust on the scale [0,1], where 0 means distrust and 1 means trust. This means that of our 15 shortest paths by length, we found one path that we slightly trust. The top result from Example 9-6 has an aggregated normalized weight of 0.28990, which converts to a trust score of 0.5130 on our +[0,1] scale. Therefore, we slightly trust this path.
过去的例子帮助我们理解如何推理我们路径中的规范化信任分数。
The past examples helped us understand how we were going to reason about the normalized trust scores in our paths.
但是,示例 9-6中的第一个结果是否是我们数据中最可信的路径?为了找到答案,我们需要对查询进行一些优化,以找到单个最短加权路径。
However, is the first result from Example 9-6 the most trusted path in our data? To find out, we need to apply some optimizations to our query to find the single shortest weighted path.
我们在此示例中使用的数据旨在向您展示如何找到两个地址之间最可信的路径。这些数据中最可信的路径不仅仅是最短路径。
The data we are using for this example aims to show you how to find the most trusted path between two addresses. The most trusted paths in this data are more than just the shortest paths.
我们希望通过我们的示例找到具有最高信任值的路径。
We want to find the paths through our example with the highest trust values from their edges.
为了实现这一点,我们必须转换边权重,以便它们可用于解决最短加权路径问题。转换过程做了两件事:(1)它使用对数,以便我们可以有意义地沿路径添加权重;(2)它翻转了尺度,以便最小加权路径与最大信任相关。
To get there, we had to convert the edge weights so that they can be used to solve the shortest weighted path problem. The conversion process did two things: (1) it used logarithms so that we can meaningfully add weights along the path, and (2) it flipped the scale so that a minimum weighted path correlates to maximum trust.
我们正在重复这个过程,因为这些是团队使用的常用工具,因此他们可以在最短路径问题中使用加权边。重塑数据以解决复杂问题说明了数据科学和图应用程序交叉领域中强大的创造力。
We are iterating this process because these are common tools that teams use so they can use weighted edges in shortest path problems. Reshaping data to solve complex problems illustrates the powerful creativity within the intersection of data science and graph applications.
利用这些知识,让我们继续开发计算最短加权路径的 Gremlin 查询。
Using this knowledge, let’s move on to developing Gremlin queries that calculate shortest weighted paths.
我们迄今为止使用的算法过程点如例9-7所示。
The algorithmic process we have been using up to this point is shown in Example 9-7.
A 使用 sack 将路径距离初始化为 0.0 B 找到你的起始顶点 v1 C 重复 D 移动到出边 E 将袋子值增加边权重 F 从边移动到传入顶点 G 如果路径是循环则删除 H 检查路径是否到达 v2 我允许前 20 条路径继续 J 根据路径的总距离值对路径进行排序 K 塑造结果
A Use sack to initialize the path distance to 0.0 B Find your starting vertex v1 C Repeat D Move to outgoing edges E Increment the sack value by the edge weight F Move from edges to incoming vertices G Remove the path if it is a cycle H Check if the path reached v2 I Allow the first 20 paths to continue J Sort the paths by their total distance value K Shape the result
回想一下,Gremlin 中的障碍步骤与repeat().until()模式类似,以广度优先搜索的方式处理数据。这意味着示例 9-7I中的步骤按长度保证最短路径。
Recall that barrier steps in Gremlin, like the repeat().until() pattern, process the data like breadth-first search. This means that step I in Example 9-7 guarantees shortest paths by length.
在示例 9-8中,这些算法步骤显示在我们刚刚执行的查询中的相应行旁边。
In Example 9-8, those algorithmic steps are shown next to the corresponding lines in the query we just did.
Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(DoutE("rated").Esack(sum).by("norm_trust").FinV().GsimplePath()).Huntil(has("Address","public_key","1337")).Ilimit(20).Jorder().by(sack(),asc).Kproject("path_information","total_elements","trust_distance").by(path().by("public_key").by("norm_trust")).by(path().count(local)).by(sack())
Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(DoutE("rated").Esack(sum).by("norm_trust").FinV().GsimplePath()).Huntil(has("Address","public_key","1337")).Ilimit(20).Jorder().by(sack(),asc).Kproject("path_information","total_elements","trust_distance").by(path().by("public_key").by("norm_trust")).by(path().count(local)).by(sack())
我们将使用伪代码模式(如示例 9-7所示),并将流程映射到 Gremlin 步骤(如示例 9-8所示),以构建最短加权路径查询。我们将在示例 9-8中的查询中添加内容,以对其进行更改,然后将其优化为最短加权路径查询。
We are going to use the pattern of pseudocode, as in Example 9-7, and mapping the process to Gremlin steps, as in Example 9-8, to build up our shortest weighted path query. We will add to the query in Example 9-8 to change and then optimize it to be a shortest weighted path query.
我们想要构建的流程实现了我们在“最短加权路径搜索优化”中引入的优化:最低成本优化、全局启发式算法和超节点避免。执行此操作共有四个步骤,以便我们可以为我们的应用程序创建生产质量的查询:
The process we want to build toward implements the optimizations we introduced in “Shortest Weighted Path Search Optimizations”: lowest cost optimization, global heuristics, and supernode avoidance. There are four steps to doing this so that we can create a production-quality query for our application:
交换两个步骤并改变我们的限制
Swap two steps and change our limit
添加一个对象来跟踪到已访问顶点的最短加权路径
Add an object to track the shortest weighted path to a visited vertex
如果遍历器的路径比已经发现的到该顶点的路径长,则删除该遍历器
Remove a traverser if its path is longer than one already discovered to that vertex
由于自定义原因删除遍历器,例如为了避免超级节点
Remove traversers for custom reasons, such as to avoid supernodes
让我们通过这四个过程逐步添加步骤来构建 Gremlin 查询。
Let’s build up the Gremlin query by incrementally adding steps through each of these four procedures.
我们经历了一次提醒,因为我们建立最短路径的第一步路径查询与示例 9-8非常相似。我们只需要交换步骤的顺序并限制为一个结果,即可将示例 9-8转换为单个最短加权路径查询。
We went through a reminder of where we are because the first step in building our shortest path query is very similar to Example 9-8. We need only to swap the order of steps and limit to one result to translate Example 9-8 to a single shortest weighted path query.
示例 9-9中的算法将示例 9-7中的步骤J和步骤互换。此互换将我们的流程从最短路径更改为最短加权路径。然后我们将限制从 20 更改为 1,以便我们找到单个最短加权路径。I
The algorithm in Example 9-9 swaps step J and step I from Example 9-7. This swap changes our process from shortest paths to shortest weighted paths. Then we change the limit from 20 to 1 so that we are finding the single shortest weighted path.
我们将从O_stepNumber本节开始用其步骤编号标记这些新优化。您将在伪代码和查询中找到星号 (*),以指示我们要添加到寻路遍历中的新行。我们发现这更容易将伪代码中的优化逻辑地映射到 Gremlin 查询中的语句。
We are going to start labeling these new optimizations with O_stepNumber and their step number from this section. You will find asterisks (*) in the pseudocode and query to indicate the new lines we are adding to our pathfinding traversal. We find this easier for logically mapping the optimization in the pseudocode to the statements in the Gremlin query.
AUsesacktoinitializethepathdistanceto0.0BFindyourstartingvertexv1CRepeatDMovetooutgoingedgesEIncrementthesackvaluebytheedgeweightFMovefromedgestoincomingverticesGRemovethepathifitisacycleHCheckifthepathreachedv2O1*SortthepathsbytheirtotaldistancevalueO1*Allowthefirstpathtocontinue,thisistheshortestpathbyweightKShapetheresult
AUsesacktoinitializethepathdistanceto0.0BFindyourstartingvertexv1CRepeatDMovetooutgoingedgesEIncrementthesackvaluebytheedgeweightFMovefromedgestoincomingverticesGRemovethepathifitisacycleHCheckifthepathreachedv2O1*SortthepathsbytheirtotaldistancevalueO1*Allowthefirstpathtocontinue,thisistheshortestpathbyweightKShapetheresult
如果真的这么简单,为什么我们不能就此止步呢?
If it is that easy, why can’t we just stop here?
我们可以,但是,嗯...有一个但是。
We can, but, well…there’s a but.
在 Gremlin 中交换步骤会引入另一个障碍步骤order()。order()紧随其后的存在repeat().until()意味着我们必须找到并排序所有路径,而不仅仅是最短的加权路径。所以我们需要在查询中添加一些内容来优化它。
Swapping the steps in Gremlin introduces another barrier step, order(). The presence of order() immediately after repeat().until() means that we have to find and sort all paths, not just the shortest weighted paths. So we will need to add a bit more to the query to optimize it.
然而,这些步骤的交换确实保证了我们塑造的路径step K按照它们的总距离排序。这最终是我们想要的;我们只是处理了比我们想要的更多的数据,因为我们仍在寻找所有路径。
The swapping of these steps, however, does guarantee that the paths that we shape at step K are ordered according to their total distance. This is ultimately what we want; we are just processing more data than we want because we are still finding all paths.
让我们看看在示例 9-10中,我们将从哪里开始构建查询,其中的逻辑与示例 9-9中的逻辑进行了交换。我们构建的更改标有O1*,以表明这是我们迄今为止构建的第一个优化。
Let’s see where we will be starting our query building in Example 9-10 with the swapped logic from Example 9-9. The changes we built are labeled with O1*, to indicate this is the first optimization we have built so far.
Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(D,E,FoutE("rated").sack(sum).by("norm_trust").inV().GsimplePath()H).until(has("Address","public_key","1337")).O1*order().by(sack(),asc).O1*limit(1).Kproject("path_information","total_elements","trust_distance").by(path().by("public_key").by("norm_trust")).by(path().count(local)).by(sack())
Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(D,E,FoutE("rated").sack(sum).by("norm_trust").inV().GsimplePath()H).until(has("Address","public_key","1337")).O1*order().by(sack(),asc).O1*limit(1).Kproject("path_information","total_elements","trust_distance").by(path().by("public_key").by("norm_trust")).by(path().count(local)).by(sack())
示例 9-101094查找和之间的所有加权路径1337。您暂时还不想在生产应用程序中使用它,因为查找所有路径的计算成本太高。我们可以应用多种优化来确保我们创建的查询在生产中的分布式图中更安全地运行。
Example 9-10 finds all weighted paths between 1094 and 1337. You do not want to use this yet in a production application because finding all paths is too computationally expensive. There are multiple optimizations we can apply to ensure that we create a query that is safer to run in a distributed graph in production.
在接下来进行的优化中,将会多次用到用于跟踪最短加权路径的对象构建。
The construction of an object to track shortest weighted paths will be used many times in the coming optimizations.
这个想法是创建一个映射。映射的键将是访问过的顶点,值将跟踪到该顶点的最短距离。算法的附加过程如示例 9-11所示。我们将示例 9-11中的过程映射到示例 9-12中的 Gremlin 查询。
The idea is to create a map. The keys of the map will be visited vertices, and the value will track the shortest distance to that vertex. The additional processes to the algorithm are shown in Example 9-11. We map the procedures from Example 9-11 to the Gremlin query in Example 9-12.
AUsesacktoinitializethepathdistanceto0.0BFindyourstartingvertexv1CRepeatDMovetooutgoingedgesEIncrementthesackvaluebytheedgeweightFMovefromedgestoincomingverticesGRemovethepathifitisacycleO2*Createamap;thekeysarevertices,valueistheminimumdistanceHCheckifthepathreachedv2O1SortthepathsbytheirtotaldistancevalueO1AllowthefirstxpathstocontinueKShapetheresult
AUsesacktoinitializethepathdistanceto0.0BFindyourstartingvertexv1CRepeatDMovetooutgoingedgesEIncrementthesackvaluebytheedgeweightFMovefromedgestoincomingverticesGRemovethepathifitisacycleO2*Createamap;thekeysarevertices,valueistheminimumdistanceHCheckifthepathreachedv2O1SortthepathsbytheirtotaldistancevalueO1AllowthefirstxpathstocontinueKShapetheresult
应用示例 9-11中的算法的查询是示例 9-12。我们构建的更改标记为O2*,以表明这是我们迄今为止构建的第二个优化。
The query that applies the algorithm from Example 9-11 is Example 9-12. The changes we built are labeled with O2*, to indicate this is the second optimization we have built so far.
Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(D,E,FoutE("rated").sack(sum).by("norm_trust").inV().GsimplePath().O2*group("minDist").// create a mapO2*by().// the keys are verticesO2*by(sack().min())// the values are the min distanceH).until(has("Address","public_key","1337")).01order().01by(sack(),asc).01limit(1).Kproject("path_information","total_elements","trust_distance").Kby(path().by("public_key").by("norm_trust")).Kby(path().count(local)).Kby(sack())
Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(D,E,FoutE("rated").sack(sum).by("norm_trust").inV().GsimplePath().O2*group("minDist").// create a mapO2*by().// the keys are verticesO2*by(sack().min())// the values are the min distanceH).until(has("Address","public_key","1337")).01order().01by(sack(),asc).01limit(1).Kproject("path_information","total_elements","trust_distance").Kby(path().by("public_key").by("norm_trust")).Kby(path().count(local)).Kby(sack())
O2*让我们讨论一下我们在示例 9-12中标记的线上构建的映射。此映射包含键和值,其中键是顶点。这里的技巧在于我们如何设置值:by(sack().min())。该图中的值将是到图中任何访问过的顶点的最小距离。
Let’s talk about the map we constructed on the lines labeled O2* in Example 9-12. This map contains keys and values where the keys are vertices. The trick here is in how we set up the values: by(sack().min()). The values in this map will be the minimum distance to any visited vertex in the graph.
本质上,这张地图创建了一个每个遍历器都可以访问的查找表,并询问:到当前顶点的当前最小距离是多少?
Essentially, this map creates a lookup table that every traverser can access and ask: what is the current minimum distance to my current vertex?
现在我们已经创建了这张地图,让我们使用它。
Now that we have created this map, let’s use it.
该地图minDist跟踪已访问的顶点以及到该顶点的最小距离。让我们使用这张地图。
The map minDist tracks a visited vertex and the minimum distance to that vertex. Let’s use this map.
对于我们流中的任何遍历者,我们想要做两件事。首先,我们要使用地图查找到目前为止我们看到的到该顶点的最小距离。然后我们要将该值与当前遍历者的行进距离进行比较。
For any traverser in our stream, we want to do two things. First, we want to use the map to look up the minimum distance we have seen so far to that vertex. Then we want to compare that value to the current traverser’s traveled distance.
如果距离相同,则意味着当前遍历者位于到当前顶点的最短路径上。如果遍历者的距离大于最短距离,则它正在探索更长的加权路径,我们希望将其从遍历管道中移除。不会出现遍历者距离较小的情况,因为我们在进行比较之前会更新地图。
If the distances are the same, that means the current traverser is on the shortest path to the current vertex. If the traverser’s distance is greater than the shortest distance, then it is exploring a longer weighted path, and we want to remove it from the traversal pipeline. There will not be a case when the traverser’s distance is less because we update the map before we do this comparison.
让我们看一下描述此过程的伪代码。示例 9-13O3*中的新优化标记为。
Let’s look at the pseudocode that describes this process. The new optimization is labeled with O3* in Example 9-13.
A 使用 sack 将路径距离初始化为 0.0 B 找到你的起始顶点 v1 C 重复 D 移动到出边 E 将袋子值增加边权重 F 从边移动到传入顶点 G 如果路径是循环则删除 O2 创建一个地图;键是顶点,值是最小距离 O3* 如果路径比最小路径长,则删除遍历器 H 检查路径是否到达 v2 O1 根据路径的总距离值对路径进行排序 O1 允许前 x 条路径继续 K 塑造结果
A Use sack to initialize the path distance to 0.0 B Find your starting vertex v1 C Repeat D Move to outgoing edges E Increment the sack value by the edge weight F Move from edges to incoming vertices G Remove the path if it is a cycle O2 Create a map; the keys are vertices, value is the minimum distance O3* Remove a traverser if its path is longer than the min path H Check if the path reached v2 O1 Sort the paths by their total distance value O1 Allow the first x paths to continue K Shape the result
为了在 Gremlin 中实现示例 9-13,我们需要引入两种新的步骤模式。首先,我们将创建一个带有filter()步骤的自定义过滤器。
To implement Example 9-13 in Gremlin, we will need to introduce two new patterns of steps. First, we will create a custom filter with the filter() step.
该filter()步骤将遍历器评估为真或假,其中假将不会将遍历器传递到下一步。
The filter() step evaluates the traverser to either true or false, where false will not pass the traverser to the next step.
在过滤步骤中,我们将使用一个新模式。我们需要创建一个模式来评估两个值a和b。在 Gremlin 中执行此操作的常见方法是创建一个映射,然后用该步骤测试地图中的对象where()。
Inside the filter step, we will use a new pattern. We need to create a pattern that evaluates two values, a and b. The common way to do this in Gremlin is to create a map and then test the objects in the map with the where() step.
该where()步骤过滤当前对象;在我们的示例中,我们将根据对象本身进行过滤。
The where() step filters the current object; in our examples, we will filter based on the object itself.
The project().where() pattern tests the objects in the map according to a provided condition in the where() step.
让我们在示例 9-14中看看这些步骤的实际效果。
Let’s see these steps in action in Example 9-14.
Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(D,E,FoutE("rated").sack(sum).by("norm_trust").inV().as("visited").GsimplePath().O2group("minDist").O2by().O2by(sack().min()).O3*filter(project("a","b").// boolean testO3*by(select("minDist").select(select("visited"))).// aO3*by(sack()).// bO3*where("a",eq("b"))// does a == b?H).until(has("Address","public_key","1337")).O1order().01by(sack(),asc).01limit(1).Kproject("path_object","total_elements","trust_distance").Kby(path().by("public_key").by("norm_trust")).Kby(path().count(local)).Kby(sack())
Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(D,E,FoutE("rated").sack(sum).by("norm_trust").inV().as("visited").GsimplePath().O2group("minDist").O2by().O2by(sack().min()).O3*filter(project("a","b").// boolean testO3*by(select("minDist").select(select("visited"))).// aO3*by(sack()).// bO3*where("a",eq("b"))// does a == b?H).until(has("Address","public_key","1337")).O1order().01by(sack(),asc).01limit(1).Kproject("path_object","total_elements","trust_distance").Kby(path().by("public_key").by("norm_trust")).Kby(path().count(local)).Kby(sack())
让我们描述一下我们的新优化线(标记为)发生了什么O3*。我们为具有步骤的遍历器创建一个布尔测试filter():如果条件为真,则遍历器将存活。测试使用模式project().where()来设置和比较变量:a和b。的值a使用我们的地图minDist获取到当前顶点的最小距离。然后我们查找遍历器的当前袋子值;这是的值b。
Let’s describe what is happening with our new optimization lines, labeled O3*. We create a boolean test for a traverser with the filter() step: if the condition is true, the traverser will survive. The test uses the project().where() pattern to set up and compare variables: a and b. The value for a uses our map minDist to get the minimum distance to the current vertex. Then we look up the traverser’s current sack value; this is the value for b.
如果到顶点的当前最小距离等于遍历者的袋子,则测试结果为True,遍历者存活。这意味着遍历者处于最短路径上,因此我们希望它继续探索该图。
If the current minimum distance to the vertex is equal to the traverser’s sack, the test resolves to True and the traverser survives. This means that the traverser is on the shortest path, so we want it to continue exploring the graph.
如果您一直在关注笔记本,那么示例 9-14是我们的加权路径查询首次能够无错误返回timeOut。这是因为此优化是减少查询中处理的路径的第一步。前两个优化让我们将它们应用于示例 9-14O3*中标记的行。
If you have been following along in the notebook, Example 9-14 is the first time our weighted path queries are able to return without a timeOut error. This is because this optimization is the first step toward the reduction of paths that we process in the query. The first two optimizations were setting us up to apply them at the lines labeled O3* in Example 9-14.
我们可以使用其他一些方法从工作树中修剪路径。
There are a few more ways that we can prune paths from our working tree.
我们添加的一个常见优化是减少搜索空间以解决路径查询的计算复杂性。具体来说,如果遍历器已经到达超级节点,我们希望将其从管道中过滤掉。超级节点的定义将根据您的数据集而有所不同,就像我们在“图中的超级节点”中讨论的 Twitter 图中的名人问题一样。
A common optimization we add reduces the search space to address the computational complexity of path queries. Specifically, we want to filter out a traverser from the pipeline if it has arrived at a supernode. The definition of a supernode will vary according to your dataset, like the celebrity problem within the Twitter graph that we talked about in “Supernodes in graphs”.
让我们看看这个图的度分布,如图9-8所示。
Let’s take a look at this graph’s degree distribution, shown in Figure 9-8.
该图的出度分布表明,大多数顶点的出边数为 20 条或更少。图 9-8中最右边的值表明,我们的数据集中的异常值有 763 条出边。
The outgoing degree distribution of this graph shows that most vertices have, say, 20 or fewer outgoing edges. The far right value in Figure 9-8 shows that the outlier in our dataset has 763 outgoing edges.
为了便于说明,假设我们要排除具有 100 个或更多传出边的顶点。示例 9-15中的伪代码显示了我们将在何处应用带有标签的此过滤器O4*。示例 9-16显示了 Gremlin 查询。
For illustrative purposes, let’s say that we want to exclude vertices with 100 or more outgoing edges. The pseudocode in Example 9-15 shows where we will apply this filter with the label O4*. Example 9-16 shows the Gremlin query.
AUsesacktoinitializethepathdistanceto0.0BFindyourstartingvertexv1CRepeatDMovetooutgoingedgesEIncrementthesackvaluebytheedgeweightFMovefromedgestoincomingverticesGRemovethepathifitisacycleO2Createamap;thekeysarevertices,valueistheminimumdistanceO3RemoveatraverserifitspathislongerthantheminpathtothecurrentvO4*Removeatraverserifitwalkedintoasupernode;100outgoingedgesormoreO4*RemoveatraverserifitsdistanceisgreaterthanwhatwewanttoprocessHCheckifthepathreachedv2O1SortthepathsbytheirtotaldistancevalueO1AllowthefirstxpathstocontinueKShapetheresult
AUsesacktoinitializethepathdistanceto0.0BFindyourstartingvertexv1CRepeatDMovetooutgoingedgesEIncrementthesackvaluebytheedgeweightFMovefromedgestoincomingverticesGRemovethepathifitisacycleO2Createamap;thekeysarevertices,valueistheminimumdistanceO3RemoveatraverserifitspathislongerthantheminpathtothecurrentvO4*Removeatraverserifitwalkedintoasupernode;100outgoingedgesormoreO4*RemoveatraverserifitsdistanceisgreaterthanwhatwewanttoprocessHCheckifthepathreachedv2O1SortthepathsbytheirtotaldistancevalueO1AllowthefirstxpathstocontinueKShapetheresult
The query that implements Example 9-15 is shown in Example 9-16.
max_outgoing_edges=100;max_allowed_weight=1.0;Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(D,E,FoutE("rated").sack(sum).by("norm_trust").inV().as("visited").GsimplePath().O2group("minDist").O2by().O2by(sack().min()).O3and(project("a","b").O3by(select("minDist").select(select("visited"))).O3by(sack()).O3where("a",eq("b")),O4*filter(sideEffect(outE("rated").count().// optimization:O4*is(gt(max_outgoing_edges)))),// remove supernodesO4*filter(sack().// optimization:O4*is(lt(max_allowed_weight))))// global heuristicH).until(has("Address","public_key","1337")).O1order().01by(sack(),asc).01limit(1).Kproject("path_object","total_elements","trust_distance").Kby(path().by("public_key").by("norm_trust")).Kby(path().count(local)).Kby(sack())
max_outgoing_edges=100;max_allowed_weight=1.0;Ag.withSack(0.0).BV().has("Address","public_key","1094").Crepeat(D,E,FoutE("rated").sack(sum).by("norm_trust").inV().as("visited").GsimplePath().O2group("minDist").O2by().O2by(sack().min()).O3and(project("a","b").O3by(select("minDist").select(select("visited"))).O3by(sack()).O3where("a",eq("b")),O4*filter(sideEffect(outE("rated").count().// optimization:O4*is(gt(max_outgoing_edges)))),// remove supernodesO4*filter(sack().// optimization:O4*is(lt(max_allowed_weight))))// global heuristicH).until(has("Address","public_key","1337")).O1order().01by(sack(),asc).01limit(1).Kproject("path_object","total_elements","trust_distance").Kby(path().by("public_key").by("norm_trust")).Kby(path().count(local)).Kby(sack())
让我们来看看示例 9-16O4*中标记的新步骤。我们添加了两个布尔测试。第一个是filter()检查当前顶点的传出度并将其与我们的超节点阈值进行比较。如果度高于 100,则遍历器未通过测试并从遍历管道中移除。这就是你可以从路径查找查询中专门移除超节点的方法。
Let’s walk through the new steps labeled O4* in Example 9-16. We added two boolean tests. The first one is a filter() that checks for the current vertex’s outgoing degree and compares it to our supernode threshold. If the degree is higher than 100, the traverser fails the test and is removed from the traversal pipeline. This is how you can specifically remove supernodes from your pathfinding query.
超级节点避免优化要求我们使用sideEffect();我们将在下一节解释原因。
The supernode avoidance optimization required us to use sideEffect(); we will explain why in the next section.
示例 9-16中的第二个优化添加了另一个优化filter()并应用了全局启发式方法。我们将要考虑的最大权重设置为 1.0,并通过将遍历器的权重与我们的阈值进行比较来测试这一点sack()。如果遍历者的距离大于 1.0,则测试失败并从管道中移除。
The second optimization in Example 9-16 adds another filter() and applies a global heuristic. We set the maximum weight we want to consider to 1.0, and we test for this by comparing the traverser’s sack() to our threshold. If the traverser’s distance is greater than 1.0, it will fail the test and be removed from the pipeline.
示例 9-16中的查询将所有优化都包含在一个新步骤中。and()我们将在接下来的两节中解释and()和,然后我们将查看示例 9-16sideEffect()的结果。
The query in Example 9-16 wrapped all of our optimizations within a new step, and(). We will explain and() and sideEffect() in the next two sections and then we will look at the results from Example 9-16.
Gremlin 中的步骤and()是一个可以进行任意次数遍历的过滤器。该and()步骤将布尔值“与”应用于每次遍历的结果,为遍历器创建通过/失败条件。
The and() step in Gremlin is a filter that can have an arbitrary number of traversals. The and() step applies a boolean AND to the results from each traversal to create a pass/fail condition for the traverser.
Gremlin 中的步骤根据每个输入遍历的值、等等,and(t1, t2, …)为管道中的每个遍历器产生真或假。t1t2
The and(t1, t2, …) step in Gremlin yields true or false for each traverser in the pipeline according to the values for each input traversal t1, t2, and so on.
在 Gremlin 上下文中,and()步骤内的每个遍历都必须产生至少一个输出。在布尔运算上下文中,以下值被解释为 false:
In the context of Gremlin, each traversal within the and() step must produce at least one output. In the context of boolean operations, the following values are interpreted as false:
False
False
所有类型的数字零
Numeric zero of all types
空字符串
Empty strings
空容器(包括元组、列表、字典、集合和冻结集)
Empty containers (including tuples, lists, dictionaries, sets, and frozen sets)
所有其他值均解释为True。
All other values are interpreted as True.
示例 9-16and()中的步骤中包含了三次遍历。测试每个遍历的布尔条件,并使用分析所有三个结果。只有当所有三个条件都为真时,遍历器才会通过。这意味着遍历器必须位于最短路径上,而不是位于超节点上,并且总距离小于 1.0。and()
There are three traversals that are wrapped within the and() step in Example 9-16. Each traversal’s boolean condition is tested, and all three results are analyzed with and(). Only if all three conditions yield true will the traverser pass. This means that the traverser must be on a shortest path, not on a supernode, and have a total distance less than 1.0.
要理解的最后一个主要概念是为什么我们需要使用sideEffect()它来进行超级节点测试。
The last main concept to understand is why we needed to use sideEffect() for our supernode test.
您可以应用于寻路查询的最有价值的启发式方法之一是当遍历器位于超节点时将其删除。您必须计算当前顶点的边数,才能确定它是否是超节点。当您位于顶点上时,您必须移动到所有传出边来计算它们。
One of the most valuable heuristics you can apply to a pathfinding query removes traversers when they are on a supernode. You have to count the edges of the current vertex to figure out whether or not it is a supernode. When you are on a vertex, you have to move to all outgoing edges to count them.
从当前顶点移动到所有传出边会改变遍历器的位置。当我们处于寻路查询的中间时,这会将遍历器的位置从顶点更改为一组边。此更改会破坏我们repeat()步骤的条件流程。
Moving from the current vertex to all outgoing edges changes the location of the traverser. When we are in the middle of a pathfinding query, this would change the location of our traverser from a vertex to a set of edges. This change would break the conditional flow of our repeat() step.
因此,我们必须检查当前顶点是否为超节点,并且以不改变当前遍历器状态(或不移动遍历器)的方式进行检查。我们可以通过使用sideEffect()(遍历器在图中移动的五种通用方式之一)来进行这些类型的附加计算。
Therefore, we have to check whether the current vertex is a supernode and do so in a way that doesn’t change the state of the current traverser (or doesn’t move the traverser). We can do these types of side computations by using sideEffect(), one of the five general ways that a traverser can move throughout a graph.
该sideEffect(<traversal>)步骤允许遍历器的状态不变地进入下一步,但是会保留所提供的遍历的一些计算值。
The sideEffect(<traversal>) step allows the traverser’s state to proceed unchanged to the next step, but with some computational value from the provided traversal.
我们在示例 9-16sideEffect(outE("rated").count().is(gt(max_outgoing_edges)))中使用过。让我们分解一下。
We used sideEffect(outE("rated").count().is(gt(max_outgoing_edges))) in Example 9-16. Let’s break this down.
首先,包含在内的遍历sideEffect()是outE("rated").count().is(gt(max_outgoing_edges))。这提出了一个问题:此顶点的传出边数是否小于我们允许的最大值?
First, the traversal that is wrapped within sideEffect() is outE("rated").count().is(gt(max_outgoing_edges)). This asks the question: is the number of outgoing edges on this vertex less than the maximum we are allowing?
要回答这个问题,遍历器必须从顶点移动到所有传出边,计数它们,并将结果与 进行比较max_outgoing_edges。问题是我们必须移动。我们不希望此移动影响遍历的状态,因此我们将此遍历包装在 中,sideEffect()以便我们在此嵌入式遍历中所做的任何事情都不会改变遍历器在移动到 之后的步骤时在图中的位置sideEffect(<traversal>)。
To answer that question, the traverser has to move from the vertex to all outgoing edges, count them, and compare the result to max_outgoing_edges. The problem is that we have to move. We do not want this move to affect the state of the traversal, so we wrap this traversal in sideEffect() so that whatever we do in this embedded traversal doesn’t change where the traverser is located in the graph as it moves to the step after sideEffect(<traversal>).
这为我们提供了有关遍历工作原理的所有信息。让我们查看并解释示例 9-16的结果。
This gives us everything we need to know about how the traversal works. Let’s look at and interpret the results of Example 9-16.
本章示例的最后一组结果如示例 9-17所示。让我们看一下现在处于最短加权路径。
The last set of results to understand for our chapter’s examples is shown in Example 9-17. Let’s look at the shortest weighted path now.
{"path_information":{"labels":[<omittedintext>],"objects":["1094","0.0","64","0.0","104","0.0","23","0.0792","1217","0.0248","1437","0.0","35","0.0248","1337"]},"total_elements":"15","trust_distance":"0.1288"}
{"path_information":{"labels":[<omittedintext>],"objects":["1094","0.0","64","0.0","104","0.0","23","0.0792","1217","0.0248","1437","0.0","35","0.0248","1337"]},"total_elements":"15","trust_distance":"0.1288"}
我们最终的最短加权路径的总信任距离为0.1288,长度为 7!(15 个元素表示 8 个顶点和 7 条边;路径长度是路径中的边数。)
Our final shortest weighted path had a total trust distance of 0.1288 and has a length of 7! (15 elements means eight vertices and seven edges; path length is the number of edges in the path.)
信任距离为0.1288,其中10^(-0.1288) = 0.7434,因此我们得出结论,我们信任这条路径。
The trust distance is 0.1288, where 10^(-0.1288) = 0.7434, so we conclude that we trust this path.
回想更广泛的应用,我们还得出结论,我们信任接受比特币1337。你会怎么做?
Thinking back to the broader application, we also conclude that we trust accepting bitcoins from 1337. What would you do?
无论您是否积极交易比特币,您已经将加权路径和确定信任的概念融入到您的日常生活中。您可能不会将数据转换为对数标准化图,但我们敢打赌您会以某种方式使用前两章中的概念。
Whether or not you actively trade Bitcoin, you’ve already integrated the concept of weighted paths and determining trust into your daily life. You may not go through transforming the data into a logarithmically normalized graph, but we would bet that you use the concepts from the past two chapters in some way.
图技术的美妙之处在于将人类的自然倾向转化为可量化的模型。在第 8 章以及本章到目前为止的内容中,我们介绍了将自然思维转化为指标和模型的多种不同方法。我们向您展示了如何使用人与概念之间的距离来教您我们如何思考数据来解决生产中的复杂路径问题。
The beauty of graph technology lies in translating natural human tendencies into quantifiable models. Throughout Chapter 8 and so far in this chapter, we walked through many different ways of translating natural thinking into metrics and models. We showed you how to use the idea of distance between people and concepts to teach you how we think about data to solve complex path problems in production.
这里最重要的时刻是,你会自然而然地对之前不相关的主题做出决策和推断。而你已经进行的这个自然发生的过程可以很好地映射到图技术上,以便在可重复的框架中量化你的决策。
The big moment here is that you naturally make decisions and inferences about previously disconnected topics. And this naturally occurring process you already do maps very well to graph technology to quantify your decision in a repeatable framework.
图谱技术为我们提供了一个框架,用于定义、建模、量化和应用我们认为理所当然的心理过程,例如将路径距离与信任相关联。这就是图谱技术如此美妙和有影响力的原因。我们无需思考就能自然而然做的事情可以用以同样方式表示它们的技术来正式和逻辑地定义。
Graph technology gives us a framework for defining, modeling, quantifying, and applying mental processes that we take for granted, like correlating path distance to trust. This is what makes graph technology so beautiful and impactful. The things we already naturally do without thinking can be formally and logically defined with technology that represents them in the same way.
那么,考虑到您在本书中走过的所有旅程,您对我们的信任度如何?一旦您开始思考您的旅程,您会为旅程的不同部分分配不同的优势吗?
So how would you rate your trust in us given all the stops along your journey throughout this book? Once you start thinking about your journey, would you assign different strengths to different pieces of your journey?
也许你应该考虑使用作者和内容评级来创建可信资源图,这似乎有点让人联想到 Netflix 如何通过基于用户评级的电影推荐来引发图思维之旅。这听起来像是一个值得探讨的好话题。
Maybe you should consider using author and content ratings to create a graph of trustable resources, which seems oddly reminiscent to how Netflix ignited the journey of graph thinking with movie recommendations based on user ratings. And that sounds like a great topic to visit next.
我们的下一章将是一个类似 Netflix 的示例,我们将向您展示如何根据用户评分推荐电影。
Our next chapter is going to be a Netflix-like example in which we show you how to recommend movies based on user ratings.
Netflix 奖是一项开放式机器学习竞赛,始于 2006 年。参加竞赛的每个团队都旨在建立一种能够超越 Netflix 自己的内容评级预测流程的算法。2009年该比赛为获胜团队颁发了100万美元奖金。
The Netflix Prize was an open machine learning competition started in 2006. Each team that entered the competition aimed to build an algorithm capable of besting Netflix’s own content rating prediction process. The competition awarded $1 million to the winning team in 2009.
Netflix 奖的一个特定衍生品在整个图论界引起了轰动,您在阅读本书时就体验到了这一结果。该竞赛激发了使用图思维作为传统矩阵算法的解决方案的热情。
One specific derivative of the Netflix Prize sent waves throughout the graph theory community, a result you are experiencing now as you read this book. The competition ignited the use of graph thinking as a solution for traditionally matrix-based algorithms.
人们意识到,用图来解释推荐系统比用矩阵表示要容易得多。想想看。你有一组最喜欢的电影,其中每部电影都受到其他人的高度评价。如果你看看这些人喜欢的其他电影,你就会得到一份你可能也喜欢的电影列表。你就有了一份电影推荐列表。
The realization was that it is much easier to explain recommendation systems with a graph than with a matrix representation. Think about it. You have a favorite set of movies, and each of those movies is highly rated by other people. If you look at the other movies liked by those people, you have a list of movies that you may also like. You have a list of movie recommendations.
您只需通过图即可找到它们。
And you just walked through a graph to find them.
Netflix Prize 1推广了利用用户和电影之间的关系来预测和个性化您的数字体验的想法。将数据视为图的这一小想法已成为图思维兴起的主要驱动力之一。
The Netflix Prize1 popularized the idea of using relationships between users and movies to predict and personalize your digital experience. This small idea of thinking about your data like a graph has become one of the main drivers of the rise of graph thinking.
我们将在本章和第 12 章中将这个想法付诸实践。如果您想知道,第 11 章向您展示了我们如何创建您将在本章中看到的图模型。
We will bring this idea to life throughout this chapter and Chapter 12. And in case you are wondering, Chapter 11 shows you how we created the graph model you will see in this chapter.
在本章中,我们将通过介绍网站/应用如何向用户推荐电影来展示和定义协同过滤。
In this chapter, we will show and define collaborative filtering by walking through how a site/app makes movie recommendations to its users.
在第一部分中,我们将介绍三个不同的推荐系统示例。这三个示例说明了图思维在应用程序中定制用户体验方面的应用有多么根深蒂固。您可能每天都在使用这些技术,甚至可能不知道。
In the first section, we are going to walk through three different examples of recommendation systems. These three examples illustrate how deeply ingrained the use of graph thinking has become for customizing a user’s experience in an application. You likely use these techniques every day, perhaps without even knowing it.
第二部分将介绍协同过滤。我们将重点介绍基于项目的协同过滤,因为它是使用图结构进行推荐的最流行方式。
The second section will walk through an introduction to collaborative filtering. We will focus on item-based collaborative filtering because it is the most popular way to use graph structures for recommendations.
第三部分将介绍两个用于电影推荐示例的开源数据集。我们将构建一个复杂的模式并向您展示数据结构和加载过程。我们将在接下来的两章中使用这些数据。
The third section will introduce two open source datasets for our movie recommendations example. We will build a complex schema and show you the data structures and loading procedures. We will be using this data throughout the next two chapters.
然后,我们将进行简短的介绍,并使用电影数据集的复杂数据模型来回顾本书的主要技术。我们将通过使用图数据修改三种最流行的制作查询来探索合并的数据集:邻域、树和路径。
Then we’ll take a short side tour and use the complex data model for the movie datasets as a review of this book’s main techniques. We will explore the merged datasets by revising the three most popular production queries with graph data: neighborhoods, trees, and paths.
本章的最后一节将逐步介绍如何在 Gremlin 中执行基于项目的协同过滤。正如您在树和路径章节中看到的,由于实时执行协同过滤的可扩展性,我们将在本章末尾遇到一个问题。
The last section of this chapter steps through doing item-based collaborative filtering in Gremlin. As you have seen in the trees and paths chapters, we will run into a problem at the end of this chapter due to the scalability of doing collaborative filtering in real time.
图推荐系统之所以流行,是因为它易于解释其工作原理。让我们通过图结构进行更深入的探索,每次探索一个领域,展示三个不同行业的建议。
The popularity of recommendation systems with graphs derives from the simplicity of explaining how they work. Let’s follow a deeper progression through graph structure, one neighborhood at a time, to show recommendations in three different industries.
我们想从我们现在如何看待这个问题开始我们的例子。
We want to start our examples with how we see the problem now.
如果您回想一下最近与医生的一些互动,您可能会列出一份最信任的医生名单。
If you think back through some of your most recent interactions with doctors, you probably have a short list of the doctors you trust the most.
现在,如果您的朋友让您推荐一位医生,您会怎么说?
Now, what would you say if your friend asked you to recommend a doctor?
要给出个人推荐,您需要考虑很多因素,例如您上次就诊的结果、治疗情况、费用等等。因此,您可能没有立即回复朋友关于您最喜欢的医生的问题;相反,您可能会向朋友询问更多信息。
To give a personal recommendation, you consider a plethora of factors, such as the outcome of your last visit, how you were treated, how expensive it was, and so on. Therefore, you likely didn’t respond immediately to your friend’s question with your favorite doctor; instead, you probably asked your friend for more information.
您需要从朋友那里获得更多背景信息,以确保您的建议与他们相关。您可以使用收集到的有关朋友问题的更多详细信息,将其与您的经历进行匹配,然后定制您的建议。
You need more context from your friend to make sure your recommendation is relevant to them. You use the additional details you gather about your friend’s question to match them up to your experiences, and then you customize your recommendation.
您最终如何回应并给出有关医疗保健的建议可能看起来像图 10-1中的绘图。您对朋友的回应表明了您如何看待建议,就像您的个人健康图的第一个邻域一样。
How you ultimately respond and give a recommendation about healthcare probably looks something like the drawing in Figure 10-1. It is your response to your friend that shows how you think about recommendations like the first neighborhood of your personal health graph.
您的建议背后有更深层次的细节,因此才具有相关性。
It is the deeper details behind your recommendation that make it relevant.
对于比医疗保健更不私人的话题,让我们看看如何使用图结构中的更深层信息来在您的社交媒体帐户中创建建议。
For a much less personal topic than healthcare, let’s see how deeper information from a graph structure can be used to create recommendations in your social media accounts.
回想一下你上次登录 LinkedIn 时的情景。你是否看到过“你可能认识的人”的通知?
Think about the last time that you logged in to LinkedIn. Did you see a notification for “people you may know”?
“您可能认识的人”部分是使用图结构在社交媒体上推荐新联系人的一个例子。本节还说明了如何使用您的第二个邻居来创建推荐。图 10-2显示了如何在图中建立您可能认识的人列表。
The “people you may know” section is an example of using graph structure to recommend new connections on social media. This section also illustrates how to use your second neighborhood to create a recommendation. Figure 10-2 shows how to build up a list of people you may know in a graph.
让我们来看看图 10-2中的概念如何在 LinkedIn 或任何其他社交媒体平台上发挥作用。
Let’s walk through how the concept from Figure 10-2 works on LinkedIn or on any other social media platform.
随着时间的推移,你在社交媒体账户上建立了一个好友列表;图 10-2用橙色的好友列表说明了这一点。你朋友的朋友构成了你可能也认识的人的列表,如图10-2右侧所示。
Over time, you have built up a list of friends on your social media account; Figure 10-2 illustrates this with the list of friends in orange. The friends of your friends form the list of people that you may also know, as shown further to the right in Figure 10-2.
就是这么简单。你可能认识的人在社交媒体上属于你的第二个朋友圈。
It really is that simple. People you may know are in your second neighborhood of friends on social media.
这通常会引发一个问题:你最有可能已经认识的人是谁?你可能已经猜到了答案。你朋友中联系最密切的朋友是你可能也认识的推荐最多的人。
This usually prompts the question: who is the person you are most likely to already know? And you probably already guessed the answer. The most connected friend of your friends is the top recommended person that you may also know.
到目前为止的示例展示了对图数据第一和第二邻域的浅层遍历。让我们看一个更深入的遍历。
The examples so far show shallow walks through first and second neighborhoods of graph data. Let’s see a walk that goes a bit deeper.
推荐产品部分已经成为任何在线零售商的期望。人们想要搜索一种产品,然后浏览该公司类似产品的目录。
The section of recommended products has become an expectation for any online retailer. People want to search for a product and then explore the company’s catalog of similar products.
通过遍历更深层次的关联数据,可以生成产品推荐。 图 10-3显示了你购买的一款产品如何生成对其他三款产品的推荐。
Product recommendations can be generated by walking through deeper neighborhoods of connected data. Figure 10-3 shows how a product you purchased can create a recommendation of three other products.
让我们想想图 10-3所展示的内容。它首先展示您购买的一款产品。您的在线零售商还知道其他人购买了该产品以及他们购买的其他产品。图 10-3最右侧的三款产品的最终列表将成为您在线购物时在“类似产品”窗口中看到的三款产品。
Let’s think about what Figure 10-3 is showing. It starts by showing one product that you purchased. Your online retailer also knows other people bought that product, as well as the other products they bought. The final list of three products on the far right in Figure 10-3 becomes the three products you see in a “similar products” window while you shop online.
通过图 10-3,你也能初步了解协同过滤的工作原理。让我们深入研究本章将要实现的算法。
Walking through Figure 10-3 also gives you your first glimpse into how collaborative filtering works. Let’s delve into the algorithm we will be implementing in this chapter.
使用图结构数据的协同过滤是一种经过验证的个性化内容推荐技术。业界对协同过滤的定义如下:
Using collaborative filtering with graph-structured data is a proven technique for personalizing content recommendations. The industry defines collaborative filtering as follows:
Collaborative filtering is a type of recommendation system that predicts new content (filtering) by matching the interests of the individual user with the preferences of many users (collaborating).
让我们简单介绍一下推荐系统和协同过滤的问题领域。
Let’s look at a quick introduction to the problem domain of recommendation systems and collaborative filtering.
协同过滤是图谱社区中非常流行的技术。但它更好因其在推荐系统类别中发挥的更大作用而闻名。一般来说,协同过滤是属于推荐系统类别的四种自动算法之一。另外三种是基于内容的、社交数据挖掘的和混合模型。footnote:[Loren Terveen 和 Will Hill,《超越推荐系统:帮助人们互相帮助》。《新千年的人机交互》,Jack Carroll 主编(波士顿:Addison-Wesley,2001 年),第 487-509 页。
Collaborative filtering is a very popular technique within the graph community. But it is better known for the larger role it plays within the class of recommender systems. Generally speaking, collaborative filtering is one of four types of automated algorithms that fall within the class of recommender systems. The other three are content-based, social data mining, and hybrid models.footnote:[Loren Terveen and Will Hill, “Beyond Recommender Systems: Helping People Help Each Other.” _HCI in the New Millennium_, ed. Jack Carroll (Boston: Addison-Wesley, 2001), 487–509.
为了让您了解所有这些概念是如何组织的,图 10-4说明了协同过滤及其子类型在推荐系统的更广泛分类中的位置。
To give you an idea of how all these concepts are organized, Figure 10-4 illustrates where collaborative filtering and its subtypes fall into the broader classification of recommender systems.
基于内容的推荐系统仅关注用户的偏好。根据用户先前的选择,从类似的内容中向用户提出新的推荐。
Content-based recommender systems are focused only on the preferences of the user. New recommendations are made to the user from similar content according to the user’s previous choices.
第二个推荐系统类别称为社交数据挖掘,它描述的是不需要任何用户输入的系统。它们仅依靠社区中流行的历史趋势向新用户提出建议。
The second class of recommenders, called social data mining, describes systems that do not need any input from a user. They rely solely on popular historical trends from the community to make recommendations to a new user.
协同过滤不同于基于内容或社交的数据挖掘,因为它结合了个人和社区的偏好。协同过滤方法侧重于将个人兴趣与类似用户社区的历史偏好相结合。最后,混合模型是一组混合并匹配其他三类技术的推荐系统。
Collaborative filtering is different from content-based or social data mining in that it combines individual and community preferences. The class of collaborative filtering approaches focuses on combining an individual’s interest with the historical preferences of a community of similar users. Last, hybrid models are a group of recommender systems that mix and match techniques from the other three classes.
最流行的协同过滤技术是基于项目的协同过滤,其出现的时间早于我们在本章开头提到的 Netflix 奖。基于项目的协同过滤是有史以来最强大的推荐系统技术之一;它最初由亚马逊于 1998 年发明和使用。2该技术于 2001 年首次发布。3
The most popular class of collaborative-filtering techniques, item-based collaborative filtering, predates the Netflix Prize we mentioned at the beginning of this chapter. Item-based collaborative filtering is one of the most robust techniques of recommendation systems of all time; it was originally invented and used by Amazon in 1998.2 The first publication of the technique occurred in 2001.3
能够针对某些情况量身定制建议根据与您相似的人的偏好推荐内容描述了两类推荐系统:基于用户的协同过滤和基于项目的协同过滤。
The ability to tailor the recommendation of certain content according to the preferences of people like you describes two classes of recommendation systems: user-based collaborative filtering and item-based collaborative filtering.
User-based collaborative filtering finds similar users who share the same rating patterns as the active user to recommend new content.
Item-based collaborative filtering finds similar items according to how users rated those items to recommend new content.
本章后面将介绍的数据包含对电影进行过评分的用户。图 10-5显示了对电影进行评分的用户模型中的不同类型的协同过滤。
The data we will be introducing later in the chapter contains users who have rated movies. Figure 10-5 shows the different types of collaborative filtering in a model of users who rated movies.
图 10-5展示了如何使用电影评分图向您推荐新内容。图 10-5的左侧展示了如何通过图执行基于用户的协同过滤并向您推荐新内容。图 10-5的右侧展示了如何通过图执行基于项目的协同过滤并向您推荐新内容。
Figure 10-5 shows how to use a graph of movie ratings to recommend new content to you. The left side of Figure 10-5 shows how to walk through the graph to perform user-based collaborative filtering and recommend new content to you. The right side of Figure 10-5 shows how to walk through the graph to perform item-based collaborative filtering and recommend new content to you.
基于用户和基于项目的协同过滤之间的基本区别在于每种技术计算的相似性。基于用户的协同过滤计算相似的用户,而基于项目的协同过滤计算相似的项目。这两种技术都使用各自的相似度分数来创建推荐。基于用户的协同过滤的任务是首先计算相似的用户,然后预测新内容的评分。基于项目的协同过滤的任务是首先计算项目之间的相似度,然后预测新内容的评分。
The basic difference between user-based and item-based comes down to what each technique computes as similar. User-based collaborative filtering computes similar users, whereas item-based collaborative filtering computes similar items. Both techniques use their respective similarity scores to create a recommendation. The tasks for user-based collaborative filtering are to first compute similar users and then predict ratings of new content. The tasks for item-based collaborative filtering are to first compute similarity between items and then predict ratings of new content.
我们将在第10章和第 12 章中介绍的所有示例中使用基于项目的协同过滤。您从本章中探索基于项目的协同过滤中学到的模式以及第 12 章中概述的生产实施过程向您展示了扩展协同过滤使用以包括其他技术(例如基于用户)的前进道路。
We will be using item-based collaborative filtering in all of our upcoming examples in Chapters 10 and 12. The patterns you learn from exploring item-based collaborative filtering in this chapter and the production implementation process outlined in Chapter 12 show you the path forward for expanding your use of collaborative filtering to include other techniques, such as user-based.
When using a graph, the general process for using item-based collaborative filtering is as follows:
输入:获取用户最近评分、查看或购买的商品
Input: Get a user’s most recently rated, viewed, or purchased items
方法:根据历史评分、浏览或购买模式查找类似商品
Method: Find similar items according to historical rating, viewing, or purchasing patterns
推荐:根据评分模型提供不同的内容
Recommend: Deliver different content according to a scoring model
上述过程可以推广至任何系统,但我们接下来要举的例子将与电影有关。
The process above can be generalized for any system, but our upcoming examples will be about movies.
使用我们的电影数据(我们将在大约四页后介绍),输入将是一个个人用户(您)和您评分的电影。该模型将使用基于项目的协同过滤,根据在数据中观察到的评分模式来查找类似的电影。推荐内容将使用评分模型对推荐进行排名。图 10-6显示了每个步骤。
Using our movie data, which we’ll introduce about four pages from now, the input will be an individual user (you) and a movie you rated. The model will use item-based collaborative filtering to find similar movies according to the rating patterns observed in the data. The recommended content will use a scoring model to rank the recommendations. Figure 10-6 shows each of these steps.
这些方法的技巧在于如何对推荐内容进行排名。
The tricks to these approaches lie in how you rank the recommended content.
我们将在示例中说明对建议进行评分和排名的三种不同方法。它们将是基本路径计数、净推荐值和标准化净推荐值。
We will be illustrating three different ways to score and rank the recommendations in our examples. They will be basic path counting, a Net Promoter Score, and a normalized Net Promoter Score.
在深入研究数据之前,让我们先来了解一下这三种方法。
Let’s walk through each of these three approaches before we dive into the data.
对于基于项目的推荐系统来说,使用图结构的最简单方法之一就是计数。具体来说,您想要计算对输入电影进行评分的用户的 5 星评分。图 10-7显示了我们的意思。
One of the simplest ways to use a graph structure for an item-based recommendation system is to count. Specifically, you want to count the 5-star ratings from the users who also rated the input movie. Figure 10-7 shows what we mean.
图 10-7显示了我们如何使用路径计数对推荐集中的电影进行排名;我们用粗体标出了两条路径,Movie C以向您展示三种计算方法中的一种。让我们来看看我们如何得出最右边显示的每个分数。
Figure 10-7 shows how we would use path counting to rank the movies in the recommendation set; we bolded the two paths that reached Movie C to show you one of the three calculations. Let’s walk through how we reached each of the scores shown on the far right.
图 10-7的结果显示Movie A是最佳选择,得分为 3。我们之所以得分为 3,是因为Movie A总共有 3 个 5 星评级,分别来自Users A、B` 和C。排名第二的电影是Movie C,它获得了两个 5 星评级,分别来自Users A和C。排名第三的电影是Movie B,得分为 1,代表来自 的 5 星评级User A。
The results of Figure 10-7 show Movie A is the top choice with a score of three. We reach a score of three because there are three 5-star ratings of Movie A in total, from Users A, B`, and C. The second-ranked movie is Movie C, which received two 5-star ratings, one each from Users A and C. The third-ranked movie is Movie B with a score of one, which represents the 5-star rating from User A.
最终的推荐集排序为:Movie A,,Movie C。Movie B
The final ordering of the recommendation set is: Movie A, Movie C, Movie B.
计算 5 星评级的路径是开始使用图中的基于项目的协同过滤的好方法。希望这个第一个示例能帮助您了解基于项目的协同过滤在图结构中的工作原理。
Counting the paths of 5-star ratings is a great place to start with item-based collaborative filtering in graphs. Hopefully, this first example is helping you to see how item-based collaborative filtering works within a graph structure.
让我们继续讨论稍微更高级的评分模型。
Let’s move on to a slightly more advanced scoring model.
净推荐值 (NPS) 是一种非常流行的指标,它使用量表来量化某人向朋友推荐某件商品的可能性。对于下一个示例,我们希望创建一个受 NPS 启发的指标,以平衡我们第一个模型中的 5 星评分和不喜欢的评分。我们将对下一个评分采用相同的方法。我们将通过平衡电影的喜欢程度和不喜欢程度来为电影创建分数。
The Net Promoter Score (NPS) is a very popular metric that uses a scale to quantify how likely somebody is to recommend an item to a friend. For this next example, we wanted to create a metric inspired by NPS to balance the 5-star ratings from our first model with the dislikes. We are going to take the same approach with this next score. We will create a score for a movie by balancing how much it is liked with how much it is disliked.
首先让我们看看如何根据数据计算 NPS 指标。图 10-8显示了计算电影 NPS 的公式。
Let’s first look at how we will calculate our NPS-inspired metric from our data. Figure 10-8 shows the equation for calculating a movie’s NPS.
我们会统计一部电影的所有正面评分,然后减去所有负面评分。对于我们的数据,我们认为评分高于 4 表示喜欢的电影,评分低于或等于 4 表示中立或不喜欢的电影。
We will count all of the positive ratings for a movie and then subtract all of the negative ones. For our data, we consider a rating above four to indicate a liked movie and a rating below or equal to four to indicate a neutral or disliked movie.
我们在第一个模型中没有显示任何非 5 星评级的边。要计算 NPS,我们需要在遍历中包含这些边。
We didn’t show any edges in our first model that were not 5-star ratings. To calculate an NPS, we will need to include those edges through our traversal.
我们不想把事情搞得太复杂,因为我们只是想让你了解 NPS 的工作原理,所以我们将在下一个示例中向你展示两种类型的边,以保持简单。在图 10-9中,粗粗的边可以被认为是评分大于 4(喜欢),而虚线细边可以被认为是评分低于 4(不喜欢)。每部电影的 NPS 显示在最右边。
We don’t want to get overly complicated because we just want to give you an idea of how the NPS works, so we are going to keep it simple by showing you two types of edges in our next example. In Figure 10-9, the thick bolded edges can be thought of as ratings greater than 4 (likes), and the dashed thin edges can be thought of as ratings below 4 (dislikes). The NPS for each movie is shown on the far right.
图 10-9中的推荐集的最终排序与之前不同:Movie A是评分最高的电影,但Movie B和Movie C并列。此示例向您展示了 NPS 如何让我们对不同友谊社区中一部电影的喜欢程度(或不喜欢程度)有不同的看法。
The final ordering of the recommendation set from Figure 10-9 is different than before: Movie A is the highest-rated movie, but Movie B and Movie C are tied. This example shows you how the NPS gives us a different idea of how liked (or disliked) a movie would be within different friendship communities.
NPS 和路径计数模型都集中在那些总是排名靠前的热门电影上。这可能会带来问题,因为同一部电影总是会被推荐;如果用户每次登录您的应用时都看到相同的内容,他们可能会失去兴趣。如果您想在推荐中引入一些多样性,我们建议使用 NPS 启发指标的规范化版本。
The NPS and path-counting models converge on massively popular movies that are always highly ranked. This can be a problem in that the same movies will always be recommended; your user may lose interest if they see the same content every time they log in to your application. We recommend a normalized version of the NPS-inspired metric if you want to introduce some diversity into your recommendations.
我们为什么要实现正常化?
Why would we want to normalize?
标准化评分将帮助您为用户选择“另类”电影,并为您的应用增添多样性。最终,您可能希望在应用中使用这两个评分来推荐两部热门电影和一部另类电影。
A normalized score will help you select “offbeat” movies for your users and add variety to your application. Ultimately, you may want to use both scores in your application to recommend two popular movies and one offbeat movie.
让我们看看如何引入规范化来解决这些问题。
Let’s take a look at how to introduce normalization as a way to address these issues.
为了解释过于受欢迎的电影,我们可以根据电影收到的评分总数来规范化电影的 NPS。图 10-10显示了我们如何使用图属性(电影的度数)来实现这一点。
To account for overly popular movies, we can normalize a movie’s NPS by the total number of ratings it has received. Figure 10-10 shows how we can do this using a graph property: the degree of a movie.
图 10-10显示了我们将在示例中使用的第三个模型。为了得出一部电影的最终得分,我们将取其 NPS,然后除以其收到的评分总数。例如,一部非常受欢迎的电影可能会在 100 个评分中获得 50 个赞,这将使其得分为 0.5。一部另类电影可能会在 25 个评分中获得 20 个赞,从而使其得分为 0.8。我们希望让输入用户有机会看到另类推荐。
Figure 10-10 shows the third model we will be using in our examples. To arrive at the final score of a movie, we will take its NPS and then divide it by the total number of ratings it has received. For example, a really popular movie may get 50 likes out of 100 ratings, which would give it a score of 0.5. An offbeat movie may get 20 likes out of 25 ratings, giving it a score of 0.8. We want to give our input users a chance to see offbeat recommendations.
图 10-11显示了我们将如何在接下来的例子中使用规范化的 NPS。
Figure 10-11 shows how we will be using the normalized NPS in our upcoming examples.
图 10-11根据每部电影的评分总数划分其 NPS。
Figure 10-11 divides each movie’s NPS according to its total number of ratings.
让我们逐步了解如何计算图 10-11中的每部电影的分数。电影 A 的 NPS 为 3,被评价了三次;电影 A 的最终得分为3/3 = 1.0。电影 B 的 NPS 为 1,被评价了一次;电影 B 的最终得分为1/1 = 1.0。电影 C 的 NPS 为 1,被评价了三次;电影 C 的最终得分为1/3 = 0.3334。
Let’s step through how we calculated each movie’s score in Figure 10-11. Movie A had an NPS of 3 and was rated three times; the final score for Movie A is 3/3 = 1.0. Movie B had an NPS of 1 and was rated once; the final score for Movie B is 1/1 = 1.0. Movie C had an NPS of 1 and was rated three times; the final score for Movie C is 1/3 = 0.3334.
一般来说,我们会展示如何让一部另类电影像一部非常受欢迎的电影一样受到高度推荐,从而让我们的电影推荐具有一定的多样性。
Generally, we are showing how an offbeat movie can be just as highly recommended as a very popular movie, allowing for some diversity in our movie recommendations.
现在您已经知道我们要去哪里了,让我们介绍一下我们将在这个例子中使用的数据模型。
Now that you have an idea of where we are going, let’s present the data model we will be using for this example.
有两个非常流行的关于电影的开源数据集,我们将使用:MovieLens 4和 Kaggle。5我们选择了 MovieLens 数据集,以便我们可以使用非常多样化且有据可查的电影用户评分数据集。Kaggle 数据集为 MovieLens 数据添加了每部电影的详细信息和演员。
There are two very popular open source datasets about movies that we are going to use: MovieLens4 and Kaggle.5 We selected the MovieLens dataset so that we could use a very diverse and well-documented dataset of user ratings of movies. The Kaggle dataset augments the MovieLens data with details and actors for each movie.
我们在第 11 章中提供了有关如何匹配、合并和建模这些数据源的所有详细信息。对于本章,我们希望跳转到在开发模式下使用数据,以便我们可以构建推荐查询。
We have provided all the details as to how we matched, merged, and modeled these data sources in Chapter 11. For this chapter, we want to jump to using the data in development mode so that we can build our recommendation queries.
我们在第 11 章中概述的 MovieLens 和 Kaggle 源之间的数据集成过程创建了我们将在示例中使用的开发模式。使用图模式语言(GSL),开发模式如图10-12所示。
The data integration process between the MovieLens and Kaggle sources that we outline in Chapter 11 created the development schema we will be using in our examples. Using the Graph Schema Language (GSL), the development schema is shown in Figure 10-12.
图 10-12中的数据模型包含很多细节。如果您希望在使用数据模型之前先了解我们如何得到它,我们建议您跳到第 11 章,我们将在其中深入概述如何合并两个数据源并创建此模型。这个过程太长太复杂了,现在无法一一讲完。实体解析的主题值得单独讨论。
The data model in Figure 10-12 has a lot of detail. If you prefer to understand how we arrived at our data model before we use it, we recommend jumping over to Chapter 11, where we outline in depth how we merged the two data sources and created this model. The process was too long and involved to go through now. And the topic of entity resolution deserves its own separate discussion.
图 10-12显示我们的数据有五个顶点标签:Movie、、、、和。每个顶点标签的分区键在属性名称旁边用 (PK) 表示。有三个边标签,每个边标签在它连接的顶点之间都有且仅有一条边:User、、Genre和。单线的 GSL 符号表示演员只在特定电影中扮演一次,一部电影只属于特定类型一次,并且一部电影只被标记为特定主题一次。有三个边标签在它们连接的顶点之间有许多边:、、和。带有聚类键 (CK) 的双线和属性的 GSL 符号表示用户可以多次评价特定电影,用户可以多次标记特定电影,并且演员可以多次与另一个演员合作。ActorTagacted_inbelongs_totopic_taggedratedtaggedcollaborated_with
Figure 10-12 shows that our data has five vertex labels: Movie, User, Genre, Actor, and Tag. The partition key for each vertex label is denoted with a (PK) next to the property name. There are three edge labels that will each have exactly one edge between the vertices it connects: acted_in, belongs_to, and topic_tagged. The GSL notation of a single line indicates that an actor acts in a specific movie only once, a movie belongs to a specific genre only once, and a movie is tagged for a specific topic only once. There are three edge labels that have many edges between the vertices they connect: rated, tagged, and collaborated_with. The GSL notation of a double line and property with a clustering key (CK) indicates that a user can rate a specific movie many times, a user can tag a specific movie many times, and an actor can collaborate with another actor many times.
我们希望您习惯于查看图架构的图像并使用 GSL 将其转换为架构语句。我们设计此流程是为了遵循 ERD 如何提供一种编程方式将概念模型转换为架构代码。
We hope you are used to looking at images of graph schema and using the GSL to translate into schema statements. We designed this process to follow how ERDs provide a programmatic way to translate a conceptual model into schema code.
从图10-12中,我们看到有五个顶点标签。这些的架构代码顶点标签如例 10-1所示。
From Figure 10-12, we see that there are five vertex labels. The schema code for those vertex labels is shown in Example 10-1.
schema.vertexLabel("Movie").ifNotExists().partitionBy("movie_id",Bigint).property("tmdb_id",Text).property("imdb_id",Text).property("movie_title",Text).property("release_date",Text).property("production_company",Text).property("overview",Text).property("popularity",Double).property("budget",Bigint).property("revenue",Bigint).create();schema.vertexLabel("User").ifNotExists().partitionBy("user_id",Int).property("user_name",Text).// Augmented, Random Data by the authorscreate();schema.vertexLabel("Tag").ifNotExists().partitionBy("tag_id",Int).property("tag_name",Text).create();schema.vertexLabel("Genre").ifNotExists().partitionBy("genre_name",Text).create();schema.vertexLabel("Actor").ifNotExists().partitionBy("actor_name",Text).create();
schema.vertexLabel("Movie").ifNotExists().partitionBy("movie_id",Bigint).property("tmdb_id",Text).property("imdb_id",Text).property("movie_title",Text).property("release_date",Text).property("production_company",Text).property("overview",Text).property("popularity",Double).property("budget",Bigint).property("revenue",Bigint).create();schema.vertexLabel("User").ifNotExists().partitionBy("user_id",Int).property("user_name",Text).// Augmented, Random Data by the authorscreate();schema.vertexLabel("Tag").ifNotExists().partitionBy("tag_id",Int).property("tag_name",Text).create();schema.vertexLabel("Genre").ifNotExists().partitionBy("genre_name",Text).create();schema.vertexLabel("Actor").ifNotExists().partitionBy("actor_name",Text).create();
关于示例 10-1中看到的数据类型的有趣事实:电影《阿凡达》的总收入如此之高,以至于我们不得不将数据类型从 Int 更改budget为revenueBigint。干得好,詹姆斯·卡梅隆。
Fun fact about the data types you see in Example 10-1: the total revenue for the movie Avatar was so high that we had to change the data type for budget and revenue from Int to Bigint. Way to go, James Cameron.
在第 11 章中我们编写的合并 MovieLens 和 Kaggle 数据源的ETL(提取-转换-加载)过程中,我们使用了 Python 的 Faker 库为我们的用户随机生成名称。这些数据与 MovieLens 项目的用户没有任何关联;名称是完全随机的。
During the ETL (extract-transform-load) process in Chapter 11 that we wrote to merge the MovieLens and Kaggle data sources, we used Python’s Faker library to randomly generate names for our users. This data is not in any way associated to the users of the MovieLens project; the names are completely random.
从图10-12中,我们看到有六个边标签。这些边标签的架构代码如示例 10-2所示。
From Figure 10-12, we see that there are six edge labels. The schema code for those edge labels is shown in Example 10-2.
schema.edgeLabel("topic_tagged").ifNotExists().from("Movie").to("Tag").property("relevance",Double).create()schema.edgeLabel("belongs_to").ifNotExists().from("Movie").to("Genre").create()schema.edgeLabel("rated").ifNotExists().from("User").to("Movie").clusterBy("timestamp",Text).// Makes the ISO 8601 standard easier to useproperty("rating",Double).create()schema.edgeLabel("tagged").ifNotExists().from("User").to("Movie").clusterBy("timestamp",Text).// Makes the ISO 8601 standard easier to useproperty("tag_name",Text).create()schema.edgeLabel("acted_in").ifNotExists().from("Actor").to("Movie").property("year",Int).create()schema.edgeLabel("collaborated_with").ifNotExists().from("Actor").to("Actor").clusterBy("year",Int).create()
schema.edgeLabel("topic_tagged").ifNotExists().from("Movie").to("Tag").property("relevance",Double).create()schema.edgeLabel("belongs_to").ifNotExists().from("Movie").to("Genre").create()schema.edgeLabel("rated").ifNotExists().from("User").to("Movie").clusterBy("timestamp",Text).// Makes the ISO 8601 standard easier to useproperty("rating",Double).create()schema.edgeLabel("tagged").ifNotExists().from("User").to("Movie").clusterBy("timestamp",Text).// Makes the ISO 8601 standard easier to useproperty("tag_name",Text).create()schema.edgeLabel("acted_in").ifNotExists().from("Actor").to("Movie").property("year",Int).create()schema.edgeLabel("collaborated_with").ifNotExists().from("Actor").to("Actor").clusterBy("year",Int).create()
我们为您完成了匹配和合并 MovieLens 和 Kaggle 数据集的数据 ETL。在此过程中,我们还格式化了新数据集,以便将其加载到 DataStax Graph 中。我们在本书中多次进行的一项更改是使用 ISO 8601 标准格式化时间,以便更轻松地推理书中的示例。
We did the data ETL of matching and merging the MovieLens and Kaggle datasets for you. Along the way, we also formatted the new dataset so that it was ready to be loaded into DataStax Graph. One change we have made a few times throughout this book is to format time in the ISO 8601 standard to make it easier to reason about examples in a book.
接下来,让我们看一些数据以及如何将数据文件加载到 DataStax Graph 中。
Next, let’s look at some of the data and at how to load the datafiles into DataStax Graph.
我们将继续使用 DataStax Graph 附带的批量加载功能,以便能够尽快将数据集加载到 Cassandra 中的底层表中。
We are continuing to use the bulk loading functionality that comes with DataStax Graph so that the datasets can be loaded into the underlying tables in Cassandra as quickly as possible.
该过程的一部分需要格式化数据文件以匹配 DataStax Graph 中的架构。我们已经为您完成了这项工作。这项工作涉及编写与我们在上一节中刚刚创建的顶点和边缘架构的属性名称相匹配的文件。
Part of that process requires formatting the datafiles to match the schema in DataStax Graph. We already did that work for you. The work involved writing files that matched the property names for the vertex and edge schema that we just created in the last section.
让我们逐步加载顶点数据,然后显示边的数据。
Let’s walk through loading the vertex data and then show the same for our edges.
我们想向您展示的第一件事是我们如何格式化一些顶点数据。我们将查看这五个文件中的三个。图 10-13显示了我们为此示例合并和创建的电影数据的前三行(包括标题)。我们overview从这本书中删去了描述。文件和加载的数据确实包含了电影的完整概述。
The first thing we want to show you is how we formatted some of the vertex data. We are going to look at three of the five files. Figure 10-13 shows the first three lines (including the header) of the movie data that we merged and created for this example. We trimmed the overview description from this book. The file and loaded data does contain the full overview of the movie.
顶点文件的标题必须与 DataStax Graph 模式中的属性名称匹配。图 10-13的第一行证实了这一点,因为我们看到标题值与示例 10-1中定义的属性键名匹配。此外,我们希望更容易查询和推断这些数据中的时间,因此我们将时间戳从纪元转换为 ISO 8601 标准并将其存储为字符串。您可以在图 10-13左侧第五列中看到这一点。这不建议用于生产,因为它会引入额外的存储成本,但它使在使用数据时更容易推断数据。
The header of a vertex file must match the property names found in the DataStax Graph schema. The first line of Figure 10-13 confirms this to be the case, as we see that the header values match the property key names we defined in Example 10-1. Additionally, we wanted to make it easier to query and reason about time in this data, so we transformed the timestamps from epoch to the ISO 8601 standard and stored them as a string. You can see this in the fifth column from the left in Figure 10-13. This is not recommended for production as it introduces an extra storage cost, but it makes it easier to reason about the data when playing around with it.
让我们看一下为此示例加载的另外两个数据文件;表 10-1显示了数据集中的一些参与者。
Let’s look at two more datafiles that we loaded for this example; Table 10-1 shows some of the actors in the dataset.
| 演员姓名 | 性别标签 |
|---|---|
图罗帕亚拉 Turo Pajala |
未知 unknown |
苏珊娜·哈维斯托 Susanna Haavisto |
未知 unknown |
马蒂·佩隆帕 Matti Pellonpää |
男性 male |
埃图·希尔卡莫 Eetu Hilkamo |
未知 unknown |
卡蒂·奥蒂宁 Kati Outinen |
女性 female |
最后,表 10-2显示了我们加载到数据库中的部分用户。我们用假名扩充了这些用户;他们与 MovieLens 用户没有任何关系。
Last, Table 10-2 shows some of the users we loaded into the database. We augmented the users with fake names; they are not in any way related to the MovieLens users.
| 用户身份 | 用户名 |
|---|---|
1 1 |
劳拉·佩斯 Laura Pace |
2 2 |
詹姆斯·桑顿 James Thornton |
3 3 |
蒂莫西·费尔南德斯 Timothy Fernandez |
4 4 |
史黛西·罗斯 Stacy Roth |
我们csv为这些示例每个顶点标签创建了一个文件,总共五个文件。我们做了这些额外的工作,以便可以非常轻松地将数据直接加载到 DataStax Graph 中。示例 10-3显示了加载数据所需的五个命令。
We created one csv file per vertex label for these examples, for a total of five files. We did this extra work so that it would be very easy to load the data directly into DataStax Graph. Example 10-3 shows the five commands needed to load the data.
dsbulk load -g movies_dev -v 电影
-url "Movie.csv"-标头true
dsbulk load -g movies_dev -v 用户
-url "User.csv"-标头true
dsbulk load -g movies_dev -v 标签
-url "Tag.csv"-标头true
dsbulk load -g movies_dev -v 类型
-url "Genre.csv"-标头true
dsbulk load -g movies_dev -v 演员
-url "Actor.csv"-标头truedsbulk load -g movies_dev -v Movie
-url "Movie.csv" -header true
dsbulk load -g movies_dev -v User
-url "User.csv" -header true
dsbulk load -g movies_dev -v Tag
-url "Tag.csv" -header true
dsbulk load -g movies_dev -v Genre
-url "Genre.csv" -header true
dsbulk load -g movies_dev -v Actor
-url "Actor.csv" -header true对于五个顶点标签中的每一个,表 10-3显示了批量加载工具处理的每个文件中的顶点总数。
For each of the five vertex labels, Table 10-3 shows the total number of vertices from each file that are processed by the bulk loading tool.
260860 260860 |
演员_顶点.csv actor_vertices.csv |
1170 1170 |
类型_顶点.csv genre_vertices.csv |
329470 329470 |
电影顶点.csv movie_vertices.csv |
1129 1129 |
标签_顶点.csv tag_vertices.csv |
138494 138494 |
用户顶点.csv user_vertices.csv |
现在我们已经将所有顶点加载到开发图中,我们可以将它们与边数据集连接在一起。
Now that we have loaded all of our vertices into our development graph, we can connect them together with the edge datasets.
最后我们想向您展示的概念是我们如何格式化一些边缘数据。我们将查看六个文件中的三个。表 10-4显示了我们为此示例创建的评级数据的前三行(包括标题)。
The last concept we want to show you is how we formatted some of the edge data. We are going to look at three of the six files. Table 10-4 shows the first three lines (header included) of the rating data that we created for this example.
| 用户_用户_id | Movie_movie_id | 等级 | 时间戳 |
|---|---|---|---|
1 1 |
2 2 |
3.5 3.5 |
2005-04-02 18:53:47 2005-04-02 18:53:47 |
1 1 |
二十九 29 |
3.5 3.5 |
2005-04-02 18:31:16 2005-04-02 18:31:16 |
正如您已经在书中多次看到的那样,标题行是格式化文件以匹配边缘标签模式的最重要概念。表 10-4显示了有关用户评分的数据如何与 DataStax Graph 中的模式匹配。标题行必须具有 DataStax Graph 中边缘表使用的属性名称;这就是为什么第一列被标记User_user_id而第二列被命名为的原因Movie_movie_id。您可以通过三种不同的方式获取此信息:(1)通过 Studio 模式检查工具,(2)通过cqlsh,或(3)按照边缘表的命名约定将其转换为 Cassandra 表。
As you have already seen a few times in the book, the header line is the most important concept for formatting your files to match an edge label’s schema. Table 10-4 shows how the data about user ratings matches to the schema in DataStax Graph. The header line has to have the property names used by the edge table in DataStax Graph; that is why the first column is labeled User_user_id and the second column is named Movie_movie_id. You can obtain this information in three different ways: (1) through the Studio schema inspection tool, (2) via cqlsh, or (3) by following the naming conventions of the edge tables into Cassandra tables.
接下来,表 10-5展示了我们为此示例创建的演员边的前三行(包括标题)。
Next, Table 10-5 shows the first three lines (header included) of the actor edges that we created for this example.
| Actor_actor_name | 年 | Movie_movie_id |
|---|---|---|
图罗帕亚拉 Turo Pajala |
1988 1988 |
4470 4470 |
苏珊娜·哈维斯托 Susanna Haavisto |
1988 1988 |
4470 4470 |
表 10-5显示了数据库中从演员到电影的两条边。我们看到演员Turo Pajala和Susanna Haavisto参演电影的年代movie_id为44701988 年。
Table 10-5 shows two edges from actors to movies in the database. We see that the actors Turo Pajala and Susanna Haavisto acted in the movie with a movie_id of 4470 in 1988.
最后,让我们看看为此示例创建的合作者边,如表 10-6所示。
Last, let’s look at the collaborator edges we created for this example, in Table 10-6.
| 参与者姓名 | 年 | 出演者姓名 |
|---|---|---|
图罗帕亚拉 Turo Pajala |
1988 1988 |
苏珊娜·哈维斯托 Susanna Haavisto |
图罗帕亚拉 Turo Pajala |
1988 1988 |
马蒂·佩隆帕 Matti Pellonpää |
表 10-6显示了关于出演同一部电影的演员的两条边。我们看到了这一点Turo Pajala,并且他们Susanna Haavisto在 1988 年出演了这部电影,因此被列为合作者。我们根据演员数据中看到的情况预测了这一点。
Table 10-6 shows two edges about actors who appeared in the same movie. We see that Turo Pajala and Susanna Haavisto acted in the movie in 1988 and therefore are listed as collaborators. We expected this based on what we saw in our actor data.
如果您愿意,您现在可以花时间查看本文附带的所有边缘文件。由于我们已经这样做了几次,我们将继续将边缘加载到数据库中。
If you would like, you can spend time now looking at all of the edge files that accompany this text. Since we have done this a few times now, we are going to move forward to loading the edges into the database.
要加载所有边,我们可以使用批量加载命令行工具将它们加载到 Apache Cassandra 中的表中。示例 10-4显示了加载此数据所需的六个命令。
To load all of the edges, we can use the bulk loading command-line tool to load them into tables in Apache Cassandra. Example 10-4 shows the six commands needed to load this data.
dsbulk load -g movies_dev -e belong_to -from 电影 -to 类型
-url 属于.csv -headertrue
dsbulk load -g movies_dev -e topic_tagged -来自电影 -到标签
-url 主题标签_100k_sample.csv -标题true
dsbulk load -g movies_dev -e 评分 -来自用户 -到电影
-url rated_100k_sample.csv -headertrue
dsbulk load -g movies_dev -e tagged -来自用户 -到电影
-url 标记.csv -headertrue
dsbulk load -g movies_dev -e acted_in -来自演员 -到电影
-url acted_in.csv -headertrue
dsbulk load -g movies_dev -e collaboration_with -来自演员 -到演员
-url 合作者.csv-headertruedsbulk load -g movies_dev -e belongs_to -from Movie -to Genre
-url belongs_to.csv -header true
dsbulk load -g movies_dev -e topic_tagged -from Movie -to Tag
-url topic_tag_100k_sample.csv -header true
dsbulk load -g movies_dev -e rated -from User -to Movie
-url rated_100k_sample.csv -header true
dsbulk load -g movies_dev -e tagged -from User -to Movie
-url tagged.csv -header true
dsbulk load -g movies_dev -e acted_in -from Actor -to Movie
-url acted_in.csv -header true
dsbulk load -g movies_dev -e collaborated_with -from Actor -to Actor
-url collaborator.csv -header true对于六个边标签中的每一个,表 10-7显示了批量加载工具处理的每个文件的总行数。
For each of the six edge labels, Table 10-7 shows the total number of lines in each file that are processed by the bulk loading tool.
836408 836408 |
行动.csv acted.csv |
2706175 2706175 |
合作者.csv collaborator.csv |
523689 523689 |
包含类型.csv contains_genre.csv |
11709769 11709769 |
电影主题标签.csv movie_topic_tag.csv |
100000 100000 |
评分.csv rated.csv |
465321 465321 |
标记.csv tagged.csv |
从这里开始,我们就可以查询 DataStax Graph 中的数据了。我们想从一些基本的探索数据开始。为了复习,我们将进行三次探索查询,重复我们在本书中教授的前三种查询模式:在电影数据中遍历街区、树木和路径。
From here, we are ready to query this data in DataStax Graph. We want to start with some basic exploration of the data. For review, we are going to do three exploration queries that repeat the first three query patterns we taught in this book: walking through neighborhoods, trees, and paths in the movie data.
我们可以用这些数据进行很多更有趣的查询。我们希望您通过应用第 4 章、第 6 章和第8 章中的技术来探索笔记本中开发模式下的可能性,以回答其他有趣的问题。
There are so many more interesting queries we could do with this data. We hope you explore what is possible in development mode in your notebook by applying the techniques from Chapter 4, Chapter 6, and Chapter 8 to answer other interesting questions.
将新数据集加载到图后,您要尝试执行的第一个查询是探索单个顶点周围的第一个邻域。让我们回顾一下围绕数据的第一个邻域走动以显示特定用户的电影评级的基本知识。
After loading a new dataset into your graph, the first queries you will want to try explore first neighborhoods around a single vertex. Let’s recall the basics of walking around the first neighborhood of your data to show a specific user’s movie ratings.
我们将在此数据中探索的第一个查询是:针对用户134558,显示此用户评分的所有电影,以及每部电影的评分。示例 10-5在 Gremlin 中显示了此查询。
The first query we are going to explore in this data is: for user 134558, show me all movies rated by this user, with each movie’s rating. Example 10-5 shows this query in Gremlin.
dev.V().has("User","user_id",134558).// WHERE: start at the useroutE("rated").// JOIN: walk out to all rated edgesproject("movie","rating","timestamp").// CREATE a json payloadby(inV().values("movie_title")).// JOIN and SELECT the movie titleby(values("rating")).// SELECT the edge's ratingby(values("timestamp"))// SELECT the edge's timestamp
dev.V().has("User","user_id",134558).// WHERE: start at the useroutE("rated").// JOIN: walk out to all rated edgesproject("movie","rating","timestamp").// CREATE a json payloadby(inV().values("movie_title")).// JOIN and SELECT the movie titleby(values("rating")).// SELECT the edge's ratingby(values("timestamp"))// SELECT the edge's timestamp
The first three results of Example 10-5 are shown in Example 10-6.
{"movie":"Toy Story (1995)","rating":"3.5","timestamp":"2013-06-08 08:22:47"},{"movie":"GoldenEye (1995)","rating":"3.5","timestamp":"2013-06-08 08:25:13"},{"movie":"Twelve Monkeys (aka 12 Monkeys) (1995)","rating":"2.0","timestamp":"2013-06-08 08:23:45"},...
{"movie":"Toy Story (1995)","rating":"3.5","timestamp":"2013-06-08 08:22:47"},{"movie":"GoldenEye (1995)","rating":"3.5","timestamp":"2013-06-08 08:25:13"},{"movie":"Twelve Monkeys (aka 12 Monkeys) (1995)","rating":"2.0","timestamp":"2013-06-08 08:23:45"},...
示例 10-6中的结果是一个映射列表。每个映射都有三个键,movie、rating和,这些键在示例 10-5 的timestamp查询中设置。我们为每个键选择了相应的值,您还会看到 ISO 8601 标准用于表示时间戳。
The result in Example 10-6 is a list of maps. Each map has the three keys, movie, rating, and timestamp, which set up in our query from Example 10-5. We selected the values for each respective key, where you also see that the ISO 8601 standard is used for representing timestamps.
图 10-14显示了示例 10-6中的前三个结果的另一种思考方式。
Figure 10-14 shows another way to think of the first three results from Example 10-6.
通常情况下,您会想要对查询结果进行一些操作。我们在本书中多次练习过如何调整查询结果。让我们再看一看如何查询用户周围的第一个邻域,134558以根据用户喜欢、不喜欢或中立的电影列出他们的电影。
More often than not, there is a little bit of manipulation you would like to do to your query results. We have practiced shaping query results many times throughout this book. Let’s take one more look at how we could query the first neighborhood around user 134558 to list our user’s movies by the ones they liked, disliked, or are neutral about.
在下一个示例中,我们想要查询用户的第一个邻域134558。但是这个我们希望134558根据用户对电影的喜欢、不喜欢或中立态度对电影进行分组。我们数据中的评分范围从0.5到5.0。我们会说评分为 4.5 或更高的电影是喜欢的。评分在 3.0 到 4.5 之间(但不包括 4.5)的电影将被视为中立。评分在 0 到 3.0 之间(但不包括 3.0)的电影将被视为不喜欢的。是的,这是一个与我们之前介绍的模型不同的评分系统;我们正在使用此示例来教授塑造邻域结果的概念。最终,我们将回到推荐。
In this next example, we want to query the first neighborhood of user 134558. But this time we want to group the movies rated by 134558 according to whether the user liked, disliked, or was neutral about them. The rating scale in our data ranges from 0.5 to 5.0. We will say that movies with a rating of 4.5 or higher are liked. Movies with a rating between 3.0 and 4.5, but not including 4.5, will be considered neutral. Movies with a rating between 0 and 3.0, but not including 3.0, will be considered disliked. Yes, this is a different rating system than the models we walked through before; we are using this example to teach concepts in shaping neighborhood results. Eventually, we will get back to recommendations.
让我们在示例 10-7中看看如何在 Gremlin 中做到这一点。
Let’s look at how to do this in Gremlin in Example 10-7.
如果您愿意,choose()可以如示例 10-7coalesce()中所示。
If you prefer, the step choose() can replace coalesce() in Example 10-7.
1dev.V().has("User","user_id",134558).// WHERE: start at the user2outE("rated").// JOIN: walk to the "rated" edge3group().// CREATE: make a group4by(values("rating").// SELECT KEYS: according to the ratings5coalesce(__.is(gte(4.5)).constant("liked"),// KEY 1: "liked"6__.is(gte(3.0)).constant("neutral"),// KEY 2: "neutral"7constant("disliked"))).// KEY 3: "disliked"8by(inV().values("movie_title").fold())// SELECT VALUES: the values
1dev.V().has("User","user_id",134558).// WHERE: start at the user2outE("rated").// JOIN: walk to the "rated" edge3group().// CREATE: make a group4by(values("rating").// SELECT KEYS: according to the ratings5coalesce(__.is(gte(4.5)).constant("liked"),// KEY 1: "liked"6__.is(gte(3.0)).constant("neutral"),// KEY 2: "neutral"7constant("disliked"))).// KEY 3: "disliked"8by(inV().values("movie_title").fold())// SELECT VALUES: the values
在查看示例 10-8中的结果之前,让我们先来看看示例 10-7中的每个步骤。示例 10-7中的第 1 行和第 2 行从用户开始,然后走到每个用户的评分。示例 10-7中的第 3 行创建了一个组。Gremlin 中的组始终包含两个组件:键和值。第一步包装第 4 行到第 7 行并设置键。第二步在第 8 行,确定组的值。键将是“喜欢”、“中立”或“不喜欢”。我们在 Gremlin 中使用该步骤(就像一个语句)来确定将用户的评分分组到哪个键中。第 5 行将所有值为 4.5 或更高的评分过滤到“喜欢”组中。经过该过滤后,剩余的边流向第 6 行的下一个过滤器,该过滤器抓取“中立”组的所有 3.0 或更高的评分。所有其他边的评级都会低于 3.0,并会进入“不喜欢”键。134558by()by()coalesceif/elif/else
Let’s walk through each step of Example 10-7 before we look at the results in Example 10-8. Lines 1 and 2 in Example 10-7 start at user 134558 and walk out to each of the user’s ratings. Line 3 in Example 10-7 creates a group. A group in Gremlin always has two components: keys and values. The first by() step wraps lines 4 through 7 and sets up the keys. The second by() step is on line 8 and determines the values for the group. The keys will be “liked,” “neutral,” or “dislike.” We use the coalesce step in Gremlin like an if/elif/else statement to determine into which key the user’s rating will be grouped. Line 5 filters all ratings with a value of 4.5 or higher into the “liked” group. After that filter, the remaining edges flow to the next filter on line 6, which grabs all ratings of 3.0 or higher for the “neutral” group. All other edges will have a rating under 3.0 and will go into the “disliked” key.
示例 10-7的第 8 行是调整查询结果的最后一步。对于此映射中的每个对象,我们希望其值是电影名称。因此,我们必须从边缘走到电影顶点并获取电影名称。
Line 8 of Example 10-7 is the last step for shaping the query results. For each object in this map, we want the value to be the movie title. Therefore, we have to walk from the edge into the movie vertex and grab the movie title.
例 10-8显示每个键的前三部电影。
Example 10-8 displays the first three movies for each key.
{"neutral":["GoldenEye (1995)","Babe (1995)","Apollo 13 (1995)",...],"liked":["Braveheart (1995)","Shawshank Redemption The (1994)","Forrest Gump (1994)",...],"disliked":["Twelve Monkeys (aka 12 Monkeys) (1995)","Stargate (1994)","Ace Ventura: Pet Detective (1994)",...]}
{"neutral":["GoldenEye (1995)","Babe (1995)","Apollo 13 (1995)",...],"liked":["Braveheart (1995)","Shawshank Redemption The (1994)","Forrest Gump (1994)",...],"disliked":["Twelve Monkeys (aka 12 Monkeys) (1995)","Stargate (1994)","Ace Ventura: Pet Detective (1994)",...]}
本章中的前两个示例查询让您了解如何遍历电影数据库中的数据邻域。我们希望它们能很好地复习我们在本书中教授的不同查询。
The first two example queries in this chapter give you an idea of how to walk through the neighborhoods of data in the movie database. We hope they were a good review of the different queries we have been teaching throughout this book.
The next main example is to walk through a tree within this dataset.
正如我们在通过传感器数据进行树查询时所说的那样,数据的分支因子很快就会失控。对于我们为该示例加载的数据来说,这仍然是正确的。
As we talked about when we did the tree queries through our sensor data, your data’s branching factor can get out of hand really quickly. That remains true for the data that we loaded for this example.
我们克服了将 Kaggle 数据集集成到数据库中的困难,以便我们可以在数据中查询某种类型的树。我们喜欢将以下查询视为查看参与者的合作者“家谱”。
We went through the difficulty of integrating the Kaggle dataset into our database so that we could have some type of tree to query in our data. We like to think of the following query as looking at an actor’s “family tree” of collaborators.
我们想要在数据中找到的树从凯文·贝肯开始,并找到与他合作过的演员谱系。我们保持简单,并以两种方式限制查询。首先,我们只想考虑他从 2009 年以后的合作。而且因为每个人都与凯文·贝肯有某种联系,所以我们只想查看树的三个层次。
The tree we want to find in our data starts with Kevin Bacon and finds a lineage of actors he worked with. We kept it simple and limited the query in two ways. First, we wanted to consider only his collaborations from 2009 onward. And because everyone is somehow connected to Kevin Bacon, we wanted to look only three levels deep into the tree.
让我们看一下示例 10-9中的查询。
Let’s look at the query in Example 10-9.
1dev.V().has("Actor","actor_name","Kevin Bacon").as("Mr. Bacon").2repeat(outE("collaborated_with").has("year",gte(2009)).as("year").3inV().as("collaborated_with").4simplePath()).5times(3).6path().7by("actor_name").8by("year")
1dev.V().has("Actor","actor_name","Kevin Bacon").as("Mr. Bacon").2repeat(outE("collaborated_with").has("year",gte(2009)).as("year").3inV().as("collaborated_with").4simplePath()).5times(3).6path().7by("actor_name").8by("year")
示例 10-9中的查询从 Kevin Bacon 开始,延伸至 2009 年开始的所有合作者。这重复了三次,其中我们在第 4 行通过数据消除了重复路径。simplePath()在深入三层之后,我们通过从路径对象中的顶点返回演员的姓名并从边返回年份来塑造第 6 行到第 8 行的结果。
The query in Example 10-9 starts at Kevin Bacon and walks out to all of his collaborators starting in 2009. This repeats three times, where we eliminate repeating paths through the data with simplePath() on line 4. After walking three layers deep, we shape the results on lines 6 through 8 by returning the actor’s name from the vertices and the year from the edges in the path object.
示例 10-10显示了示例 10-9的前两个结果。
Example 10-10 shows the first two results from Example 10-9.
{"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","2009","David Koechner","2009","Bob Gunton","2009","Gretchen Mol"]},"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","2009","Renée Zellweger","2010","Forest Whitaker","2009","Jessica Biel"]},...
{"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","2009","David Koechner","2009","Bob Gunton","2009","Gretchen Mol"]},"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","2009","Renée Zellweger","2010","Forest Whitaker","2009","Jessica Biel"]},...
Figure 10-15 visualizes the tree of results from Example 10-10.
我们想在这个数据中探索的最后一个查询让人想起了我们针对比特币数据构建的寻路查询。让我们看看如何在这个数据集中寻找路径。
The last query we want to explore in this data recalls the pathfinding queries we built over the Bitcoin data. Let’s look at how to find paths in this dataset.
每个演员都与凯文·贝肯有联系。让我们用这个俗语来寻找数据集中两个演员之间的路径。
Every actor is connected to Kevin Bacon. Let’s use this colloquialism to find paths between two actors in our dataset.
示例 10-11使用collaborated_with边查找 Kevin Bacon 和 Morgan Freeman 之间的前三条最短路径。
Example 10-11 uses the collaborated_with edges to find the first three shortest paths between Kevin Bacon and Morgan Freeman.
1dev.V().has("Actor","actor_name","Kevin Bacon").as("Mr. Bacon").2repeat(outE("collaborated_with").as("year").3inV().as("collaborated_with")).4until(has("Actor","actor_name","Morgan Freeman").as("Mr. Freeman")).5limit(3).6path().7by("actor_name").8by("year")
1dev.V().has("Actor","actor_name","Kevin Bacon").as("Mr. Bacon").2repeat(outE("collaborated_with").as("year").3inV().as("collaborated_with")).4until(has("Actor","actor_name","Morgan Freeman").as("Mr. Freeman")).5limit(3).6path().7by("actor_name").8by("year")
回想一下,的模式repeat().until()使用不受障碍的广度优先搜索。因此,当我们limit(3)在示例 10-11的第 5 行时,我们实际上是在寻找此数据集中满足停止条件的三条最短路径。正如我们在本书中多次提到的那样,第 6、7 和 8 行构成了路径对象的结果。
Recall that the pattern of repeat().until() uses breadth-first search without a barrier. Therefore, when we have limit(3) on line 5 in Example 10-11, we are really finding the three shortest paths in this dataset that satisfy the stopping condition. As we have walked through many times throughout this book, lines 6, 7, and 8 shape the results of the path object.
例 10-12展示了例 10-11的 JSON 负载。
Example 10-12 shows the JSON payload for Example 10-11.
{"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","1979","Julie Harris","1990","Morgan Freeman"]},{"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","1982","Mickey Rourke","1989","Morgan Freeman"]},{"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","1983","Ellen Barkin","1984","Morgan Freeman"]}
{"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","1979","Julie Harris","1990","Morgan Freeman"]},{"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","1982","Mickey Rourke","1989","Morgan Freeman"]},{"labels":[["Mr. Bacon"],["year"],["collaborated_with"],["year"],["collaborated_with"]],"objects":["Kevin Bacon","1983","Ellen Barkin","1984","Morgan Freeman"]}
为了好玩,我们还想看看图结构中的结果。图 10-16显示了示例 10-12中的三条路径。
For fun, we also wanted to take a look at the results in their graph structure. Figure 10-16 shows the three paths from Example 10-12.
我们希望您发现本节中的四个查询有助于回顾我们在本书中教授的查询概念。我们让您可以将这些查询转换为生产查询;我们不会在生产推荐章节中这样做。
We hope you found the four queries in this section to be a helpful review of the query concepts we have been teaching throughout this book. We leave it up to you to turn these into production queries; we won’t be doing that in the production recommendation chapter.
现在让我们回到本章的主题:推荐系统。下一节将构建不同的 Gremlin 查询来向您展示如何进行协同过滤。
Now let’s return to the topic of this chapter: recommendation systems. The next section will build up different Gremlin queries to show you how to do collaborative filtering.
我们已经建立了用例,定义了协同过滤,查看了一些示例,并探索了我们的数据。本章的最后一部分重点介绍如何执行基于项目的协同过滤,以向我们的数据中的用户推荐新电影。我们将向您展示在开发环境中执行此操作的三种不同方法。
We have built up our use case, defined collaborative filtering, seen some examples, and explored our data. The last part of this chapter focuses on performing item-based collaborative filtering to recommend new movies to a user in our data. We are going to show you three different ways to do that in a development environment.
让我们开始使用图数据进行基于项目的协同过滤的第一种方法的查询和结果。
Let’s get started with the query and results for the first approach to item-based collaborative filtering with graph data.
我们向用户推荐电影的第一种方法将遵循基本的路径计数方法。示例 10-13 概述了遍历图数据执行此操作的一般过程。
The first way that we will be recommending movies to a user will follow the basic path-counting approach. The general process of walking through your graph data to do this is outlined in Example 10-13.
对于特定用户
走到他们评分的最后一部电影
走向所有对该电影评分较高的用户
步行至这些用户评分较高的所有电影
对所有电影进行分组和计数set
,并按频率降序排列电影
热门电影推荐setFor a specific user
Walk to the last movie they rated
Walk to all users who highly rated that movie
Walk to all movies highly rated by these users
Group and count all movies in the recommendation set
Sort the movies by frequency, in descending order
The top movies form the recommendation set示例 10-13中的伪代码概述了我们将如何在第一个协同过滤示例中遍历图数据。第一种方法本质上是计算电影在推荐集中出现的频率。根据用户最近的评分,得分最高的电影将被视为最有可能的推荐。
The pseudocode in Example 10-13 outlines how we are going to walk through our graph data for our first collaborative-filtering example. This first approach essentially counts how often a movie shows up in the recommendation set. The movies with the highest scores would be considered the most likely recommendations based on the user’s most recent rating.
示例 10-14显示了根据示例 10-13中概述的方法执行的 Gremlin 查询。
Example 10-14 shows the Gremlin query from the approach outlined in Example 10-13.
1dev.V().has("User","user_id",694).// look up a user2outE("rated").// traverse to all rated movies3order().by("timestamp",desc).// order all edges by time4limit(1).inV().// traverse to the most recent rated movie5aggregate("originalMovie").// put this movie in a collection6inE("rated").has("rating",gt(4.5)).outV().// users who rated this movie 57outE("rated").has("rating",gt(4.5)).inV().// the full recommendation set8where(without("originalMovie")).// remove the original movie9group().// create a map of the recommendations10by("movie_title").// an entry's key is the movie title,11by(count()).// the value will be the total # of ratings12unfold().// unfold all map entries into the pipeline13order().// order the results14by(values,desc)// by their count, descending
1dev.V().has("User","user_id",694).// look up a user2outE("rated").// traverse to all rated movies3order().by("timestamp",desc).// order all edges by time4limit(1).inV().// traverse to the most recent rated movie5aggregate("originalMovie").// put this movie in a collection6inE("rated").has("rating",gt(4.5)).outV().// users who rated this movie 57outE("rated").has("rating",gt(4.5)).inV().// the full recommendation set8where(without("originalMovie")).// remove the original movie9group().// create a map of the recommendations10by("movie_title").// an entry's key is the movie title,11by(count()).// the value will be the total # of ratings12unfold().// unfold all map entries into the pipeline13order().// order the results14by(values,desc)// by their count, descending
让我们逐步完成示例 10-14。第 1 行和第 2 行在图中查找特定用户顶点并遍历到该用户的所有评分。第 3 行和第 4 行按时间对评分进行排序,并仅遍历电影顶点的最新评分。我们在第 5 行将这部电影存储在一个集合中,以便我们稍后可以从推荐选项中删除这部电影。在第 6 行,我们从这部电影开始遍历到所有给这部电影打了 5 分的用户。我们从这些用户开始遍历到他们也打了 5 分的所有电影。此时,我们在第 8 行删除了原始评分电影。
Let’s step through Example 10-14. Lines 1 and 2 look up a specific user vertex in the graph and traverse out to all of the user’s ratings. Lines 3 and 4 sort the ratings by time and traverse through only the most recent rating to the movie vertex. We store this movie in a collection on line 5 so that we can later remove the movie from the recommendation options. On line 6, we walk from the movie to all users who have rated that movie with a 5. We traverse from these users to all movies that they have also rated with a 5. At this point, we remove the original rated movie on line 8.
我们在第 9 行开始格式化结果集,并创建了一个映射。第 10 行显示映射的键为movie_title。第 11 行显示值是到达该电影的遍历器总数。因为这是一个映射,所以我们在第 12 行将映射中的所有条目展开到遍历管道中。第 13 行和第 14 行根据其值对各个映射进行排序。
We start formatting our result set on line 9, where we create a map. Line 10 shows that the keys of the map will be movie_title. Line 11 shows that the values will be the total number of traversers that have reached that movie. Because this is one map, we unfold all entries in the map into the traversal pipeline on line 12. Lines 13 and 14 order the individual maps according to their values.
图 10-17显示了示例 10-14中推荐的前五部电影。
Figure 10-17 shows the top five recommended movies from Example 10-14.
在图 10-17中,我们看到《肖申克的救赎》的得分为 24,《阿甘正传》的得分为 22,《阿波罗 13 号》的得分为 21。
In Figure 10-17, we see that The Shawshank Redemption has a score of 24, Forrest Gump has a score of 22, and Apollo 13 has a score of 21.
第一个模型仅基于通过样本数据跟踪 5 星评级。让我们看看如何使排名算法更加复杂一些。
This first model is based only on tracing 5-star ratings through our sample data. Let’s see how to make the ranking algorithm a bit more sophisticated.
我们推荐电影的第二种方式是使用净推荐值(NPS)的某个版本。我们将 4 分或更高的评分视为喜欢的电影,低于 4 分的评分视为不喜欢的电影。我们将在处理用户评分时将其添加到我们之前概述的相同流程中。让我们查看示例 10-15中的伪代码,以了解如何遍历图的数据。
The second way we want to recommend movies uses a version of the Net Promoter Score (NPS). We will consider a rating of 4 or higher to represent a liked movie and a rating less than 4 to represent a disliked movie. We will add this into the same process that we outlined before when we process the user’s ratings. Let’s look at the pseudocode in Example 10-15 to understand how we will be walking through the graph’s data.
对于特定用户
走到他们评分的最后一部电影
走向所有对该电影评分较高的用户
走到所有传出的评级边缘
对于每条边
如果评分为 4 或更高,
将 1 存入遍历器的袋子中
如果评分低于 4 分,
将 -1 存入遍历器的袋子中
走进所有电影
将所有电影分组到推荐集中
对于该组中的每部电影,
通过将所有遍历者的袋子加起来来计算电影的 NPS
按 NPS 对电影进行降序排序
推荐集中最热门的电影For a specific user
Walk to the last movie they rated
Walk to all users who highly rated that movie
Walk to all outgoing rating edges
For each edge
If the rating is 4 or higher,
Store 1 in the traverser's sack
If the rating is less than 4,
Store -1 in the traverser's sack
Walk into all movies
Group all movies in the recommendation set
For each movie in the group,
Calculate the movie's NPS by adding all the traversers' sacks
Sort the movies by NPS, in descending order
The top movies form the recommendation set示例 10-15中概述的方法与我们的第一个模型非常相似。唯一的增加发生在我们从用户集遍历到所有用户的评分时,因为我们包括了所有评分。如果评分为 4 或更高,我们将在总体 NPS 中加一。如果评分小于 4,我们将从总体 NPS 中减一。
The approach outlined in Example 10-15 is very similar to our first model. The only addition occurs when we traverse out from the user set to all of the user’s ratings because we are including all ratings. If the rating is 4 or greater, we will add one to the overall NPS. If the rating is less than 4, we will subtract one from the overall NPS.
我们将使用sack()Gremlin 中的步骤尽可能高效地完成此操作。我们将允许每个遍历器遍历数据,并在其袋子中跟踪沿途边缘的评级。然后,我们将像以前一样将所有遍历器分组在一起。但我们不会计算到达电影的遍历器总数,而是将存储在袋子中的值相加以创建 NPS。之后,我们将遵循上一个查询中看到的相同排序过程。
We will use the sack() step in Gremlin to do this as efficiently as possible. We will allow each traverser to walk through the data and keep track of the edge’s rating along the way in its sack. Then we will group all of the traversers together as we did before. But instead of counting the total number of traversers that reached a movie, we will add together the values stored in their sack to create an NPS. After that, we will follow the same ordering process that we saw in the last query.
示例 10-16显示了根据示例 10-15中概述的方法执行的 Gremlin 查询。
Example 10-16 shows the Gremlin query from the approach outlined in Example 10-15.
1dev.withSack(0.0).// use sack to calculate NPS2V().has("User","user_id",694).3outE("rated").4order().by("timestamp",desc).5limit(1).inV().6aggregate("originalMovie").7inE("rated").has("rating",gt(4.5)).outV().8outE("rated").9choose(values("rating").is(gte(4.0)),// testing the rating value10sack(sum).by(constant(1.0)),// add 1 if user liked the movie11sack(minus).by(constant(1.0))).// subtract 1 if disliked12inV().13where(without("originalMovie")).14group().15by("movie_title").16by(sack().sum()).// NPS: sum all sack values17unfold().18order().19by(values,desc)
1dev.withSack(0.0).// use sack to calculate NPS2V().has("User","user_id",694).3outE("rated").4order().by("timestamp",desc).5limit(1).inV().6aggregate("originalMovie").7inE("rated").has("rating",gt(4.5)).outV().8outE("rated").9choose(values("rating").is(gte(4.0)),// testing the rating value10sack(sum).by(constant(1.0)),// add 1 if user liked the movie11sack(minus).by(constant(1.0))).// subtract 1 if disliked12inV().13where(without("originalMovie")).14group().15by("movie_title").16by(sack().sum()).// NPS: sum all sack values17unfold().18order().19by(values,desc)
让我们逐步完成示例 10-16。第一个新部分是第 1 行的用法withSack(0.0),就像我们在上一章计算加权路径中看到的那样。第 2 行到第 8 行遵循与我们在本节中介绍的第一个查询相同的设置。
Let’s step through Example 10-16. The first new piece is the use of withSack(0.0) on line 1, like we saw in the last chapter on calculating weighted paths. Lines 2 through 8 follow the same setup as the first query we walked through in this section.
示例 10-16的第 9 行展示了如何开始设置基于用户评分计算 NPS。我们向您展示了如何使用choose(condition, true, false)Gremlin 的语义。条件在第 9 行,检查边的评级是否大于或等于 4。如果为真,第 10 行显示我们如何将其添加1.0到遍历器的袋子中。如果第 9 行的条件为假,则从1.0遍历器的袋子中减去。在第 12 行,遍历器移动到所有电影以查看评级,第 13 行删除原始电影。
Line 9 of Example 10-16 shows how we start to set up for calculating the NPS based on a user’s rating. We are showing you how to use choose(condition, true, false) semantics with Gremlin. The condition is on line 9 and checks if the edge’s rating is greater than or equal to 4. If this is true, line 10 shows how we add 1.0 into the traverser’s sack. If the condition on line 9 is false, we subtract 1.0 from the traverser’s sack. On line 12, the traversers move to all of the movies for the ratings, and line 13 removes the original movie.
第 14 行到第 19 行遵循与之前相同的分组和排序过程,但有一个变化。第 16 行显示映射中电影的值是到达该电影的所有遍历器袋子的总和。我们将把一堆 1 和/或 -1 加在一起。示例 10-16中的前五部推荐电影如图 10-18所示。
Lines 14 through 19 follow the same grouping and sorting process as before, but with one change. Line 16 shows that the value for a movie in the map is the sum of all of the traverser’s sacks that arrived at that movie. We will be adding together a bunch of 1s and/or –1s. The top five recommended movies from Example 10-16 are shown in Figure 10-18.
在图 10-18中,我们看到了与图 10-17中不同的一组建议:《亡命天涯》的得分为 30,《星球大战:第四集—新希望》的得分为 28,《阿甘正传》的得分也是 28。
In Figure 10-18, we see a different set of recommendations from what we saw in Figure 10-17: The Fugitive has a score of 30, Star Wars: Episode IV—A New Hope has a score of 28, and Forrest Gump also has a score of 28.
第二个模型仍然会产生一组重复的结果,其中热门电影继续作为主要推荐出现。让我们更进一步,看看如何规范化结果集,以便尝试找到一组多样化的推荐。
This second model can still produce a repetitive set of results in which popular movies continue to show up as the main recommendations. Let’s take this one step further and see how we can normalize the result set so we can try to find a diverse set of recommendations.
我们将对数据使用基于项目的协同过滤的最后一种方法说明了在评分模型中使用规范化的一种方法。我们仍将像上一节中一样使用电影的 NPS,但最终将用 NPS 除以我们观察到的该电影的评分总数。 示例 10-17演示了我们在遍历图数据时如何执行此操作的伪代码。
The final way we will use item-based collaborative filtering on our data illustrates one way to use normalization in the scoring model. We will still use a movie’s NPS as we did in the last section but will ultimately divide the NPS by the total number of ratings we have observed for that movie. Example 10-17 walks through the pseudocode for how we will do this as we walk through our graph data.
对于特定用户
走到他们评分的最后一部电影
走向所有对该电影评分较高的用户
走到所有传出的评级边缘
对于每条边
如果评级为4或更高,则存储1在横移器's sack
If the rating is less than 4, store –1 in the traverser'的袋子中
走进所有电影
将所有电影分组到推荐中set
对于组中的每部电影,
计算其 NPS
统计所有收到的评分
用收到的评分除以 NPSFor a specific user
Walk to the last movie they rated
Walk to all users who highly rated that movie
Walk to all outgoing rating edges
For each edge
If the rating is 4 or higher, store 1 in the traverser's sack
If the rating is less than 4, store –1 in the traverser's sack
Walk into all movies
Group all movies in the recommendation set
For each movie in the group,
Calculate its NPS
Count all of its incoming ratings
Divide NPS by incoming ratings示例 10-17遵循与计算 NPS 相同的过程。但是,当我们创建推荐图时,我们会将 NPS 除以电影的评分总数。让我们在示例 10-18中看看如何在 Gremlin 中执行此操作。
Example 10-17 follows the same process as when we had to calculate our NPS. However, when we create our map of recommendations, we will divide the NPS by the movie’s total number of incoming ratings. Let’s see how we will do this in Gremlin in Example 10-18.
1dev.withSack(0.0).2V().has("User","user_id",694).3outE("rated").4order().by("timestamp",desc).5limit(1).inV().6aggregate("originalMovie").7inE("rated").has("rating",gt(4.5)).outV().8outE("rated").9choose(values("rating").is(gte(4.0)),10sack(sum).by(constant(1.0)),11sack(minus).by(constant(1.0))).12inV().13where(without("originalMovie")).14group().15by("movie_title").16by(project("numerator","denominator").// NPS/degree(movie)17by(sack().sum()).// this is NPS18by(inE("rated").count()).// this is the degree of the movie19math("numerator/denominator"))// this is how we divide them
1dev.withSack(0.0).2V().has("User","user_id",694).3outE("rated").4order().by("timestamp",desc).5limit(1).inV().6aggregate("originalMovie").7inE("rated").has("rating",gt(4.5)).outV().8outE("rated").9choose(values("rating").is(gte(4.0)),10sack(sum).by(constant(1.0)),11sack(minus).by(constant(1.0))).12inV().13where(without("originalMovie")).14group().15by("movie_title").16by(project("numerator","denominator").// NPS/degree(movie)17by(sack().sum()).// this is NPS18by(inE("rated").count()).// this is the degree of the movie19math("numerator/denominator"))// this is how we divide them
示例 10-18中的第 1 行到第 15行与我们计算 NPS 的示例 10-16的前 15 行相同。新代码跨越第 16 行到第 19 行,并显示了如何将 NPS 除以电影的入度。
Lines 1 through 15 in Example 10-18 are the same as the first 15 lines of Example 10-16, in which we calculated the NPS. The new code spans lines 16 through 19 and shows how to divide the NPS by the in-degree of the movie.
在示例 10-18的第 16 行中,我们填写了将填充结果图的电影的值。进入地图的值将是电影的 NPS 除以其总评分数。我们使用步骤创建一个只有两个元素的地图project()。然后,第 19 行的数学步骤将这两个值相除,并将结果放入组中。第 17 行构成地图的第一个元素,是电影的 NPS。第 18 行构成第二个元素,是传入的评分总数。前五个结果如图10-19所示。
On line 16 of Example 10-18, we are filling in the values for the movies that will populate our map of results. The value that will go into the map will be the NPS for the movie divided by its total number of ratings. We create a map with only two elements by using the project() step. Then the math step on line 19 will divide these two values and put the result into the group. Line 17 forms the first element of the map and is the movie’s NPS. Line 18 forms the second element and is the total number of incoming ratings. The first five results are shown in Figure 10-19.
图 10-19的结果显示了标准化 NPS 的前五个示例。所有五个示例都具有正分数,根据此模型,它们将被视为“喜欢”。您可以在本章的 Studio Notebook中探索具有负分数的电影。
The results of Figure 10-19 show the first five examples of the normalized NPS. All five examples have a positive score and would be considered “liked” according to this model. There are movies with negative scores that you can explore in the Studio Notebook for this chapter.
你们中的一些人可能想知道为什么我们没有显示图 10-19的排序版本和前五条建议。我们显示前五条而不是前五条,因为这个最后的查询正在突破我们在遍历中合理计算的极限,即使在这个小样本集上也是如此。正是这个额外的inE("rated").count(),即每个顶点的另一次完整分区扫描,使得这个查询极其昂贵。
Some of you may be wondering why we aren’t showing the sorted version of Figure 10-19 and the top five recommendations. We are showing the first five instead of the top five because this final query is pushing the limits of what we can reasonably compute within a traversal, even on this small sample set. It is that additional inE("rated").count(), which is another full partition scan per vertex, that is making this query extremely expensive.
为了能够在生产环境中提供符合真实用户期望的基于项目的协同过滤,我们已经远远超出了实时情况下合理的范围。
In order to be able to deliver item-based collaborative filtering in a production environment with real user expectations, we have gone significantly past what would be reasonable to do in real time.
现在,你可以选择下一步去哪里。你有两个选择。
So at this point, you get to choose where to go next. You have two options.
第一个选择是回过头来了解我们为这个例子合并的数据。第 11 章简要介绍了我们如何将 MovieLens 和 Kaggle 数据匹配在一起,以形成本章中看到的模型和查询。我们敢打赌,任何图用户都必须经历某种形式的数据清理和合并,无论它有多简单。如果您对简单的实体解析感兴趣,请继续阅读下一章。
The first option is to go back and understand the data we merged for this example. Chapter 11 takes a brief side tour into how we matched the MovieLens and Kaggle data together for the model and queries you saw in this chapter. We would bet that any graph user has to walk through some form of data cleaning and merging, no matter how simple it may be. If you are interested in simple entity resolution, continue on to the next chapter.
如果您想跳过基本实体解析的细微差别,我们不会责怪您。在这种情况下,请跳到第 12 章继续使用基于项目的协同过滤的生产版本。在第 12 章中,我们将解释为什么我们在开发模式下完成的遍历不能在生产环境中运行。我们将介绍本书的最后一个生产技巧,并向您展示如何使用图数据从基于项目的协同过滤中提供推荐。
If you want to skip the nuances of basic entity resolution, we won’t blame you. In that case, jump ahead to Chapter 12 to continue with the production version of item-based collaborative filtering. In Chapter 12, we will explain why the traversals we have worked through here in development mode cannot be run in a production environment. We will walk through the last production tip for this book and show you how to deliver recommendations from item-based collaborative filtering with graph data.
1 James Bennett 和 Stan Lanning,《Netflix 奖》,《KDD Cup 和研讨会论文集》,2007 年。
1 James Bennett and Stan Lanning, “The Netflix Prize,” _Proceedings of KDD Cup and Workshop, 2007.
2 Gregory D. Linden、Jennifer A. Jacobi 和 Eric A. Benson,《使用项目到项目相似性映射的协作推荐》,美国专利号 6,266,649,于 2001 年 7 月 24 日提交。
2 Gregory D. Linden, Jennifer A. Jacobi, and Eric A. Benson, Collaborative recommendations using item-to-item similarity mappings, U.S. Patent No. 6,266,649, filed July 24, 2001.
3 3 Badrul Munir Sarwar、George Karypis、Joseph Konstan 和 John Riedl,“基于项目的协同过滤推荐算法”,WWW '01:第 10 届万维网国际会议论文集,香港会议展览中心,2001 年 5 月 1 日至 5 日(纽约:ACM,2001 年),285-95。https ://doi.org/10.1145/371920.372071。
3 3 Badrul Munir Sarwar, George Karypis, Joseph Konstan, and John Riedl, “Item-Based Collaborative Filtering Recommendation Algorithms,” WWW ’01: Proceedings of the 10th International Conference on World Wide Web, Hong Kong Convention and Exhibition Center, May 1–5, 2001 (New York: ACM, 2001), 285–95. https://doi.org/10.1145/371920.372071.
4 F. Maxwell Harper 和 Joseph A. Konstan,“MovieLens 数据集:历史和背景”, ACM 交互式智能系统汇刊 (TiiS) 5,第 4 期 (2016):19, https://doi.org/10.1145/2827872。
4 F. Maxwell Harper and Joseph A. Konstan, “The MovieLens Datasets: History and Context,” ACM Transactions on Interactive Intelligent Systems (TiiS) 5, no. 4 (2016): 19, https://doi.org/10.1145/2827872.
5 Stephane Rappeneau,“来自 themoviedb.org 的 350 000+ 部电影”,Kaggle,2016 年 7 月 19 日, https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg。
5 Stephane Rappeneau, “350 000+ Movies from themoviedb.org,” Kaggle, July 19, 2016, https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg.
回想一下本书中的第一个例子,您如何知道 C360 模型中的客户是谁?
Thinking back to our first example in this book, how do you know who your customer is in your C360 model?
您的数据集中是否有强标识符,例如社会保险号或会员 ID? 您在多大程度上相信这些标识符及其来源能够 100% 准确地代表独特的人群?
Do you have a strong identifier in your dataset, like a social security number or member ID? How much do you trust those identifiers, and their source, to represent unique people with 100% accuracy?
不同行业对于不准确性的容忍程度不同。
Different industries have different tolerance levels for inaccuracy.
在医疗保健领域,误报可能导致误诊,甚至可能导致致命的药物分发。另一方面,如果您处理的是电影数据,不正确的电影分辨率将导致您的应用程序用户体验不够顺畅,但至少我们谈论的不是某人的生命受到威胁。
In healthcare, false positives can lead to misdiagnoses and potentially deadly distributions of medicine. On the other hand, if you are working with data about movies, incorrect movie resolution will lead to a less-than-seamless user experience for your application, but at least we are not talking about someone’s life being on the line.
自从我们开始记录有关人员的信息以来,从数据源中的键和值推断谁是谁、什么是什么的问题一直是一个难题。这个问题称为实体解析,有着悠久的技术解决方案历史。
The problem of inferring who is whom and what is what from keys and values in your data source has been a challenge since we began writing down information about people. This problem is called entity resolution and has a long, storied history of technical solutions.
对于任何致力于实体解析的团队来说,在业务领域可接受的误差范围内把事情做好是很重要的。
For any team working on entity resolution, it is important to get things right within whatever margin of error is acceptable in your business domain.
在本章中,我们将揭示如何合并两个电影数据集、我们在此过程中面临的挑战以及我们做出的决定。
In this chapter, we will unveil how we merged two movie datasets, the challenges we faced along the way, and the decisions we made.
首先,我们将定义实体解析以及它与我们在本书中一直教授的两个问题的关系:C360 和电影推荐。
First, we will define entity resolution and how it relates to two problems we have been teaching in this book: C360 and movie recommendations.
第二部分详细介绍了这两个数据集。我们将详细了解数据,以迭代方式构建概念图模型。我们在本节中构建的最终图模型与我们在第 10 章中介绍的用于开发的概念图模型相同。
The second section walks through the two datasets in detail. We will create a detailed understanding of the data to iteratively build up a conceptual graph model. The final graph model we build out in this section is the same conceptual graph model we introduced for development in Chapter 10.
第三部分逐步介绍我们的合并过程。我们希望您在进入方法论部分时有正确的期望:两个数据源所需的匹配和合并类型不需要图结构来进行实体解析。我们希望本节中的详细信息能帮助您了解原因。
The third section steps through our merging process. We want you to have the right expectation going into our methodology section: the type of matching and merging required with the two data sources does not need graph structure for entity resolution. We hope the details in this section help you see why.
然后,我们将深入研究合并过程中发现的错误,并介绍数据中假阳性和真阴性之间的区别。
We’ll then dig into the errors we discovered during the merging process and introduce the difference between false positives and true negatives in the data.
最后,我们将从合并电影数据的具体细节中抽离出来。我们将快速浏览一些常见的错误使用图结构来解析数据中的实体的问题。然后我们将展示一些图结构增强实体解析管道的示例。
Finally, we’ll zoom back out from the specific details of merging our movie data. We will take a quick look at some common problems that misapply the use of graph structure for resolving entities in data. Then we will show a few examples in which graph structure augments an entity resolution pipeline.
最终,我们在本章中的目标是双重的。
Ultimately, our goal in this chapter is twofold.
首先,我们想向您展示合并数据的真实情况。警告:这个过程并不光鲜。合并数据集是一项繁琐的工作,尽管它是创建图模型的常见第一步,但经常被忽视。
First, we want to show you what it is really like to merge data. Warning: the process isn’t glamorous. Merging datasets is tedious work that is often overlooked even though it is a common first step to creating a graph model.
本章的第二个目标是让您了解整个问题领域。由于合并数据是创建图数据库最常见的第一步,我们希望这些信息能够帮助您了解解决这个复杂问题所需的所有工具。提示:您最有可能使用的大多数(如果不是全部)实体解析技术都不需要图结构来弄清楚谁是谁。
The second goal of this chapter is to educate you on the overall problem domain. Because merging data is one of the most common first steps to creating a graph database, we want the information to help you understand all of the tools you need for solving this complex problem. Hint: most if not all of the entity resolution techniques you will most likely be using do not require graph structure to figure out who is whom.
两个数据源之间的匹配和合并过程的总体主题是一个巨大的问题领域,称为实体解析。非正式地说,实体解析的复杂问题旨在解决跨不同数据源的谁是谁或什么是什么的问题。
The overarching theme of the matching and merging process between two data sources is a vast problem domain called entity resolution. Informally, the complex problem of entity resolution aims to resolve the question of who is whom or what is what across different data sources.
“Jon Smith” 和 “John Smith” 是同一个人吗?或者以我们的电影数据为例,MovieLens 上的一部名为Das Versprechen 的电影和 Kaggle 上的一部名为The Promise 的电影是同一个吗?
Is “Jon Smith” the same person as “John Smith”? Or in the case of our movie data, is a movie from MovieLens called Das Versprechen the same as one from Kaggle named The Promise?
然而,在大多数传统情况下,由于多种原因,链接身份的唯一用户标识符可能不可用:使用外部源数据、由于用户隐私限制导致数据不可用或数据不一致。
However, in most traditional cases, a unique user identifier that links identities is likely not available for many reasons: use of external source data, unavailability of data due to user privacy constraints, or inconsistent data.
大多数情况下,必须根据每条数据的现有属性的键和值来计算逻辑身份。
In most cases, logical identity has to be calculated from the keys and values of the present properties about each piece of data.
从历史上看,实体解析(也称为实体匹配或记录链接)依赖于一组概率规则,这些规则通常由领域专家定义,以考虑特定领域的数据分布和数据偏差。这组概率规则组合起来形成一个功能模型,用于计算实体 a 是否等于实体 b。
Historically, entity resolution (also referred to as entity matching or record linkage) relied on a set of probabilistic rules, usually defined by a domain expert, to take into account data distributions and biases of data for a particular domain. The set of probabilistic rules combine to form a functional model to calculate whether entity a is equal to entity b.
通常,实体解析首先要找到可以跨记录系统链接的强标识符。之后,您开始寻找有关数据的不同属性,以确定系统是否真的指的是相同的逻辑身份。
Typically, entity resolution starts with finding strong identifiers that can be linked across systems of record. After that, you begin to look for different properties about the data to determine whether or not the systems really are referring to the same logical identity.
示例 11-1概述了在不同数据源中解析身份的过程;我们将在本章的其余部分多次参考这个概述。
The process for resolving identities within different data sources is outlined in Example 11-1; we will refer back to this outline a few times throughout the rest of the chapter.
A. 确定数据源 B.分析每个可用的键和值sourceC. 确定哪些键可以强有力地识别单个逻辑概念 D. 确定哪些键弱识别单个逻辑概念 E. 迭代until匹配和合并的数据是"good enough": 1.形成配套流程 2. 识别不正确的匹配 3. 解决匹配过程中的错误 4.重复步骤#1
A. Identify your data sources B. Analyze the keys and values available from eachsourceC. Map out which keys strongly identify a single logical concept D. Map out which keys weakly identify a single logical concept E. Iterateuntilyour matched and merged data is"good enough": 1. Form a matching process 2. Identify incorrect matches 3. Resolve errors in the matching process 4. Repeat Step#1
您分析您的来源及其所拥有的密钥,并反复建立如何将它们匹配在一起的规则。
You analyze your sources and the keys you have within them and iteratively build up rules on how to match them together.
听起来很简单。
Sounds simple enough.
但整个过程都取决于我们在步骤 E 中提出的“足够好”的想法。从这里开始,这个过程更像是一门艺术,而不是一门科学。
But the entire process hinges on the idea of “good enough” that we state at step E. This is where the process begins to feel more like an art than a science.
从从数学角度来看,图 11-1定义了如何量化“足够好”。
From a mathematical perspective, Figure 11-1 defines how you quantify “good enough.”
图 11-1 的内容为:对于数据集中的所有a和,定义一个函数。该函数比较两部分数据,并给出一些分数。如果分数高于某个阈值,那么我们说与相同。bDfff(a,b)tab
Figure 11-1 reads as: for all a and b in your dataset D, define a function f. The function f compares two pieces of data, f(a,b), and gives you some score. If the score is above a certain threshold t, then we say that a is the same as b.
例 11-1、图 11-1和“Jon Smith = John Smith 吗?”都表达了同一件事。
Example 11-1, Figure 11-1, and “Does Jon Smith = John Smith?” are all saying the same thing.
为了说明这个复杂的问题,图 11-2展示了跨不同数据源匹配和合并数据的概念。
To illustrate this complex problem, Figure 11-2 shows the concept of matching and merging data across disparate sources of data.
图 11-2以图模型的形式显示了实体解析问题的可视化。图 11-2左侧的图说明了大多数数据架构的当前状态;移动、Web 和现场数据库包含同一客户的不连贯视图。图技术最流行的用途是客户 360 模型,它从统一的连通图开始,如图11-2右侧所示。
Figure 11-2 displays a visualization of the entity resolution problem as a graph model. The graph on the left in Figure 11-2 illustrates the current state for most data architectures; mobile, web, and onsite databases contain disconnected views of the same customer. The most popular use of graph technology, a Customer 360 model, starts from the unified, connected graph, like what is shown on the right in Figure 11-2.
本书中的第一个示例就是从与图 11-2右侧完全相同的图模型开始的。
And the first example in this book started with a graph model exactly like the right side of Figure 11-2.
使用图模型来描述此问题的所有组成部分的便捷性恰恰说明了为什么许多团队会误用图技术来进行整个实体解析过程,而其中大多数过程并不依赖图来提供技术解决方案。
The ease with which you can describe all components of this problem with a graph model illustrates exactly why many teams misapply graph technology for the entire entity resolution process, most of which doesn’t rely on a graph for its technical solution.
我们想向你展示如何分析两个流行的开源电影数据集——MovieLens和 Kaggle。1我们的流程与示例 11-1 中的步骤 A 到D相似。
We want to show you how we analyzed two popular open source movie datasets—MovieLens and Kaggle.1 Our process parallels steps A through D in Example 11-1.
我们选择了 MovieLens 数据集,这样我们就可以使用非常多样化且有据可查的电影用户评分数据集。Kaggle 数据集为 Movielens 数据添加了每部电影的详细信息和演员。
We selected the MovieLens dataset so that we could use a very diverse and well-documented dataset of user ratings of movies. The Kaggle dataset augments the Movielens data with details and actors for each movie.
决定将两个数据集整合在一起最终成为我们为本书做出的最佳决定之一,因为这要求我们真正深入研究开始使用图技术的过程。为了说明这一点,本节将引导您了解我们在合并这两个数据集时如何推理概念图数据模型。
Deciding to bring together two datasets ended up being one of the best decisions we made for this book because it required us to really dig into the process of what it is like to get started with graph technology. To illustrate that, this section walks you through exactly how we reasoned about our conceptual graph data model as we merged these two datasets.
从...开始MovieLens 源代码,我们将查看可用的数据文件以及它们如何组合在一起。然后我们将浏览 Kaggle 数据。此过程最重要的部分是确定 Kaggle 数据集中的哪些键和值引用 MovieLens 数据集中的相同逻辑概念。具体来说,我们将寻找可以使用哪些强标识符来将数据集匹配在一起。
Starting with the MovieLens source, we will take a look at the datafiles available and how they will fit together. Then we will walk through the Kaggle data. The most important part of this process is identifying which keys and values from the Kaggle dataset refer to the same logical concepts from the MovieLens dataset. Specifically, we will be looking for which strong identifiers we can use to match the datasets together.
接下来的部分很长很详细,可以让您真正了解这个过程。
The upcoming section is long and detailed to give you a real glimpse into the process.
我们使用了六个文件对于 MovieLens 数据集中的模式和示例:
There are six files that we used for our schema and example from the MovieLens dataset:
links.csv
links.csv
movies.csv
movies.csv
ratings.csv
ratings.csv
tags.csv
tags.csv
genome-tags.csv
genome-tags.csv
genome-scores.csv
genome-scores.csv
我们将逐步浏览六个文件中的每一个,同时从第 10 章开始迭代构建我们的发展图模型。不过,接下来只有五个小节,因为我们将在同一节中讨论genome-tags.csv和。genome-scores.csv
We are going to step through each of the six files while we iteratively construct our developmental graph model from Chapter 10. There are only five upcoming subsections, however, because we are going to talk about genome-tags.csv and genome-scores.csv in the same section.
我们从 MovieLens 的文件开始,links.csv因为它是外部数据源强标识符的来源。该links.csv文件包含 27,278 行链接标识符,可用于链接电影数据的外部来源。该文件标题行之后的每一行代表一部电影,格式如下:
We started with the links.csv file from MovieLens because it is the source of strong identfiers to external data sources. The links.csv file contains 27,278 lines of linking identifiers that can be used to link external sources of movie data. Each line of this file after the header row represents one movie and has the following format:
movieId、imdbId、tmdbId
movieId,imdbId,tmdbId
Each strong identifier is defined as follows:
movieId是MovieLens 项目使用的电影标识符。
movieId is an identifier for movies used by the MovieLens project.
imdbId是IMDB使用的电影标识符。
imdbId is an identifier for movies used by IMDB.
tmdbId是TMDB使用的电影标识符。
tmdbId is an identifier for movies used by TMDB.
例如,电影《玩具总动员》的 amovieId为 1 ( https://movielens.org/movies/1 ), aimdbId为tt0114709( http://www.imdb.com/title/tt0114709 ), atmdbId为 862 ( https://www.themoviedb.org/movie/862 )。
For example, the movie Toy Story has a movieId of 1 (https://movielens.org/movies/1), an imdbId of tt0114709 (http://www.imdb.com/title/tt0114709), and a tmdbId of 862 (https://www.themoviedb.org/movie/862).
我们从此文件开始数据建模过程,并构建了如图 11-3所示的模式。
We started the data modeling process with this file and built the schema shown in Figure 11-3.
图 11-3中的模式有一个顶点标签:Movie。此顶点具有分区键movie_id和两个附加属性:tmdb_id和imdb_id。我们将大小写从更改camelCase为snake_case符合使用 Apache Cassandra 时的命名标准。
The schema in Figure 11-3 has one vertex label: Movie. This vertex has a partition key of the movie_id and two additional properties: tmdb_id and imdb_id. We changed the casing from camelCase to snake_case to conform to naming standards when working with Apache Cassandra.
按照示例 11-1中概述的过程,我们了解了有关此文件的以下信息:
Following the process we outlined in Example 11-1, we learned the following information about this file:
总共有 27,278 部电影。
There are 27,278 movies in total.
27,278 部电影有imdbId(100%覆盖率)。
27,278 movies have an imdbId (100% coverage).
26,992 部电影有tmdbId(98.95% 覆盖率)。注意:这些电影也有imdbId。
26,992 movies have a tmdbId (98.95% coverage). Note: these movies also have an imdbId.
252 部电影缺少tmdbID。
252 movies are missing a tmdbID.
正在检查数学运算的人可能已经注意到 27,278 不等于 26,992 + 252。比较结果相差 34,因为此数据集在将电影映射tmdbId到其 时存在 17 个错误imdbId。我们将在后面的部分深入探讨这个问题。
Those of you who are checking the math here may have observed that 27,278 does not equal 26,992 + 252. The comparison is off by 34 because there are 17 errors in this dataset in mapping a movie’s tmdbId to its imdbId. We will delve into this issue in a later section.
此信息告诉我们,MovieLens 数据已 100% 覆盖来自 IMDB 数据源的强标识符。因此,在匹配此数据时要检查的第一个标识符将是imdb_id。
This information tells us that the MovieLens data has 100% coverage of strong identifiers from the IMDB data source. Therefore, the first identifier to check when matching into this data will be the imdb_id.
让我们看一下开始用有关每部电影的更多信息填充数据模型的数据集。
Let’s look at the dataset that starts to populate the data model with more information about each movie.
MovieLens 数据集有一个movies.csv文件,其中包含每部电影的标题和类型。MovieLens 资源表明我们links.csv通过 将这些数据与信息连接起来movieId。
The MovieLens dataset has a movies.csv file that contains a title and genres for each movie. The MovieLens resources indicate that we connect this data to the links.csv information via the movieId.
电影文件中针对数据集中的 27,278 部电影,每部都有一个条目。每行的结构如下:
There is an entry in the movies file for each of the 27,278 movies in the dataset. Each line has the structure:
电影编号,标题,类型_1|类型_2|...|类型_n
movieId,title,genre_1|genre_2|...|genre_n
根据 MovieLens 文档,电影标题是手动输入或从 MovieLens 项目导入的。类型是一个以竖线分隔的列表,从动作、冒险、喜剧、犯罪、戏剧和西部等主题中选择。
According to the MovieLens documentation, the movie titles were entered manually or imported from the MovieLens project. The genres are a pipe-delimited list and are selected from topics such as Action, Adventure, Comedy, Crime, Drama, and Western.
我们发现这个系列中有 18 种独特的流派。
We discovered that there are 18 unique genres in this set.
我们继续使用此文件执行数据建模过程并添加模式。我们的模式的下一个迭代如图 11-4所示。
We continued the data modeling process with this file and added the schema. The next iteration of our schema is shown in Figure 11-4.
该文件为我们的数据模型提供了三项附加内容。图 11-4显示了movies.csv该文件与数据模型三项附加内容之间的映射。movies.csv
The movies.csv file gave us three additions to our data model. The mapping of the movies.csv file to three additions to the data model is shown in Figure 11-4.
首先,我们增强了Movie顶点,使其具有movie_ title属性。其次,我们创建一个Genre顶点,并根据 对该顶点进行分区。第三,我们创建了从顶点到标签为 的顶点genre_name的边。MovieGenrebelongs_to
First, we augmented the Movie vertex to have a movie_ title property. Second, we created a Genre vertex and partitioned that vertex by the genre_name. Third, we created an edge from the Movie vertex to the Genre vertex with the label belongs_to.
我们需要 MovieLens 数据集来获取其用户评分。接下来让我们看看该文件。
We needed the MovieLens dataset for its user ratings. Let’s take a look at that file next.
文件中记录了 20,000,263 条用户对某部电影的评分ratings.csv。该文件的每一行代表一位用户的一条评分。该文件的格式为:
There are 20,000,263 ratings from users to a movie in the ratings.csv file. Each line of this file represents one rating by one user. The format of the file is:
用户 ID、电影 ID、评分、时间戳
userId,movieId,rating,timestamp
userIds此文件让我们首次了解到 MovieLens 数据库中的用户。在超过 2000 万条评分中,共有 138,493 位用户。评分以5 星为单位,每半星递增 [0.5 星,5.0 星]。时间戳以纪元为单位:自 1970 年 1 月 1 日午夜协调世界时 (UTC) 以来的秒数。
This file gives us our first glimpse of users from the MovieLens database. There are 138,493 unique userIds across the 20-million-plus ratings. Ratings are made on a 5-star scale, with half-star increments [0.5 stars, 5.0 stars]. Timestamps are in epoch: seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
该ratings.csv文件引入了一个新的顶点和将边标签添加到我们的数据模型中。图 11-5使用此新信息扩充了架构。
The ratings.csv file introduces a new vertex and edge label into our data model. Figure 11-5 augments the schema with this new information.
如图 11-5所示,我们的数据模型现在有一个User顶点标签和一个rated边标签。我们User按 划分顶点user_id。我们在边上添加了rating和属性。timestamprated
As seen in Figure 11-5, our data model now has a User vertex label and a rated edge label. We partitioned the User vertices by the user_id. We added the rating and timestamp properties onto the rated edge.
除了评级之外,用户还提供了自己对数据的标签。每个标签都是一个单词或短语,由用户创建。用户为电影创建的标签有 465,564 个。
In addition to ratings, the users also provided their own tags about the data. Each tag is a single word or short phrase and was created by the user. There are 465,564 tags created by users about movies.
文件中的每一行都tags.csv具有以下结构:
Each line in the tags.csv file has the structure:
用户ID,电影ID,标签,时间戳
userId,movieId,tag,timestamp
我们使用的信息从标签文件中继续构建我们的数据模型。下一个迭代如图11-6所示。
We use the information from the tags file to continue to build our data model. The next iteration is shown in Figure 11-6.
如图11-6所示,我们可以使用userId和将用户的标签链接到电影。我们在边缘上movieId对tag_name和进行了建模。timestamptagged
As illustrated in Figure 11-6, we can link a tag from a user to a movie using the userId and movieId. We modeled the tag_name and timestamp on the tagged edge.
MovieLens 数据中还有最后一个概念可以添加到我们的数据模型中:标签基因组。
There is one last concept from the MovieLens data to add to our data model: the tag genome.
您可以在 MovieLens 数据集集合中找到两个文件:genome-tags.csv和genome-scores.csv。这两个文件分析了我们在图 11-6中建模的标签,并表示一部电影可以通过用户标签的属性多强地描述。
There are two files that you can find within the collection of MovieLens datasets: genome-tags.csv and genome-scores.csv. These two files analyze the tags we modeled in Figure 11-6 and represent how strongly a movie can be described by properties from the user tags.
标签基因组是使用机器学习算法根据用户贡献的内容(包括标签、评级和文本评论)计算得出的。2
The tag genome was computed using a machine learning algorithm on user-contributed content, including tags, ratings, and textual reviews.2
该文件genome-scores.csv包含 11,709,768 个电影标签相关性分数,格式如下:
The file genome-scores.csv contains 11,709,768 movie-tag relevance scores in the following format:
电影ID,标签ID,相关性
movieId,tagId,relevance
第二个文件genome-tags.csv提供了基因组文件中 1,128 个标签的标签描述,格式如下:
The second file, genome-tags.csv, provides the tag descriptions for 1,128 tags in the genome file, in the following format:
tagId,标签
tagId,tag
标签为我们提供了此数据集的新顶点标签和边标签,并且是使用 MovieLens 数据建模的最后一次迭代。 将tagId映射到顶点的分区键Tag,tag_id,并且tag将映射到tag_name。我们来看看图11-7。
The tags give us a new vertex label and edge label for this data set, and are the last iteration in modeling with the MovieLens data. The tagId will map to the partition key for the Tag vertex, tag_id, and the tag will map to tag_name. Let’s take a look in Figure 11-7.
图 11-7中的概念数据模型表示 MovieLens 数据到图模型的完整映射。此模型中的强标识符是需要理解和遵循的最重要的部分。在所有强标识符中,最重要的是遵循movie_id,因为它在每个文件中用于将每个概念与电影联系起来。
The conceptual data model in Figure 11-7 represents the full mapping of the MovieLens data into a graph model. The strong identifiers within this model are the most important pieces to understand and follow. Among all of the strong identifiers, the most important to follow is movie_id, because it is used in every file to connect each concept to a movie.
您可以选择以不同的方式映射数据,这没问题。这一切都取决于您最终如何在生产环境中查询这些信息以及您想从这些集合中提出的问题。
You may choose to map the data differently, and that is OK. It all comes down to how you end up querying the information and the questions you want to ask from these sets in a production environment.
图 11-7中的模型是一个很好的开发起点。
The model we have in Figure 11-7 is a good starting place for development.
Let’s build on this model with the data available from Kaggle.
Kaggle 上的数据集有两个主要信息来源,我们将使用它们来扩充我们的数据模型:电影数据和演员数据。让我们按照与处理 MovieLens 数据相同的流程继续构建我们的数据模型。
There are two main sources of information from a dataset on Kaggle that we are going to use to augment our data model: movie data and actor data. Let’s follow the same process as we did with the MovieLens data to continue to build our data model.
Kaggle 数据集是一个很好的来源,原因有二。首先,它包含最完整的电影信息列表,包含 329,044 部独特电影的数据。
The Kaggle dataset is an excellent source for two reasons. First, it contains the most complete listing of movie information, with data available for 329,044 unique movies.
每部电影都拥有大量详细信息,这是 Kaggle 数据成为优秀来源的第二个原因。包含有关电影的所有详细信息的文件是All MoviesDetailsCleaned.csv。此文件中有 22 个不同的标题,描述了有关电影的其他公开信息,例如其预算、原始语言、概述、受欢迎程度、制作公司、运行时长、标语、发布日期和许多其他事实。
The plethora of details available for each movie is the second reason the Kaggle data is an excellent source. The file that contains all of the details about a movie is All MoviesDetailsCleaned.csv. There are 22 different headers in this file that describe additional publicly available information about a movie, such as its budget, original language, overview, popularity, production companies, runtime, tagline, release date, and many other facts.
这些数据中最重要的关键是id和imdb_id。以下是我们了解到的来自 Kaggle 数据的强标识符:
The most important keys in this data are id and imdb_id. Here is what we learned about the strong identifiers from the Kaggle data:
来自idKaggle 数据集的映射到tmdb_id来自 TMDB 的。
The id from the Kaggle dataset maps to the tmdb_id from TMDB.
imdb_id来自 IMDB 的电影 ID 的映射。
The imdb_id maps to the movie IDs from IMDB.
Kaggle 数据集中的所有 329,044 部电影都具有来自 TMDB 的标识符。
All 329,044 movies from the Kaggle dataset have identifiers from TMDB.
Kaggle 数据集中的 78,480 部电影缺少 IMDB 的 ID。
78,480 movies from the Kaggle dataset are missing an ID from IMDB.
我们从 Kaggle 获得的唯一可以与 MovieLens 数据进行比较的其他信息是电影的标题。
The only other information we have from Kaggle to compare with the MovieLens data is a movie’s title.
Kaggle 数据集中强标识符的覆盖率有助于我们开始了解如何将这些数据与 MovieLens 进行匹配和合并。Kaggle 数据源对来自 TMDB 的强标识符的覆盖率为 100%,而 MovieLens 数据源对来自 IMDB 的强标识符的覆盖率几乎为 100%。
The coverage of strong identifiers in the Kaggle dataset helps us begin to understand how we are going to match and merge this data with MovieLens. The Kaggle data source has 100% coverage on strong identifiers from TMDB, whereas the MovieLens data source has almost 100% coverage on strong identifiers from IMDB.
数据源之间强标识符覆盖范围不匹配既有好处也有坏处。这很糟糕,因为匹配过程不会很简单。然而,一线希望是,这个例子将成为匹配数据的一个很好的教育工具。
A mismatch in strong identifier coverage between the data sources is both bad and good. It is bad because the matching process is not going to be straightforward. The silver lining, however, is that this example is going to make for a great educational tool on matching data.
我们从AllMoviesDetailsCleaned.csv文件中提取了七条信息来扩充我们的数据模型。图 11-8说明了数据模型开发的下一阶段。
From the AllMoviesDetailsCleaned.csv file, we pulled seven pieces of information to augment our data model. Figure 11-8 illustrates the next stage in the development of our data model.
图 11-8显示了我们添加到电影顶点的六个新属性:上映日期、制作公司、概述、受欢迎程度、预算和收入。我们从 Kaggle 数据中提取的第七个细节是类型属性。这增加了Genre从电影到类型的更多顶点和边。
Figure 11-8 shows six new properties we added to the movie vertex: release date, production company, overview, popularity, budget, and revenue. The seventh detail we pulled from the Kaggle data was the genre property. This augmented more Genre vertices and edges from movies to genres.
我们需要这个数据集,以便我们可以合并每部电影的演员信息。让我们看看如何访问这些信息。
We needed this dataset so that we could merge in the information about actors for each movie. Let’s look at how we can access that information.
该文件AllMoviesCastingRaw.csv提供了每部电影的演员、导演、制片人和剪辑师的信息。我们仅选择了参与者来包含在我们的示例中。
The file AllMoviesCastingRaw.csv provided information about actors, directors, producers, and editors for each movie. We selected only the actors to include in our examples.
该文件列出了 329,044 部电影中每部电影的 5 名演员。此信息列在一行上,前 11列AllMoviesCastingRaw.csv的结构如下:
The AllMoviesCastingRaw.csv file lists five actors for each of the 329,044 movies. This information is listed on one line, with the following structure for the first 11 columns:
id,actor1_name,actor1_gender,...,actor5_name,actor5_gender....
id,actor1_name,actor1_gender, ..., actor5_name, actor5_gender....
每个演员都通过将 ID 与顶点进行匹配来与他们的电影联系tmdb_id起来Movie。
Each actor was connected to their movie by matching the ID to the tmdb_id of the Movie vertex.
此外,我们collaborator还为出演同一部电影的演员创建了边。我们使用文件release_date中的AllMoviesDetailsCleaned.csv为这些有关演员的新边标签添加了年份。
Additionally, we created collaborator edges for actors who were in the same movie. We used the release_date from the AllMoviesDetailsCleaned.csv file to add a year to each of these new edge labels about actors.
图 11-9显示了数据模型我们以这个例子为例。
Figure 11-9 shows the data model we arrived at for our example.
图 11-9显示了 MovieLens 和 Kaggle 数据集合并后的数据模型。我们没有包含 Kaggle 数据集中可用的所有内容。如果您想使用其中的内容,请访问 https://oreil.ly/graph-book 。我们将接受对本文附带的数据和流程的拉取请求。
Figure 11-9 shows the merged data model with the MovieLens and Kaggle datasets. We didn’t include everything available from the Kaggle dataset. If there is something you would like to use, please visit us at https://oreil.ly/graph-book. We will accept pull requests for the data and processes that accompany this text.
MovieLens 和 Kaggle 源之间的数据集成过程创建了我们将在示例中使用的开发模式。使用图模式语言(GSL),开发模式如图11-10所示。
The data integration process between the MovieLens and Kaggle sources created the development schema we will be using in our examples. Using the Graph Schema Language (GSL), the development schema is shown in Figure 11-10.
第 10 章向您展示了如何使用 GSL 将图 11-10转换为模式语句。我们设计此过程是为了让您遵循 ERD 所流行的相同理念。
Chapter 10 showed you how to use the GSL to translate Figure 11-10 into schema statements. We designed this process for you to follow the same idea popularized by ERDs.
对于这个示例,合并 MovieLens 和 Kaggle 数据集的想法变得比我们预想的要困难和复杂得多。
The idea to merge the MovieLens and Kaggle datasets for this example became a much more difficult and involved task than we anticipated.
根据我们的经验,匹配和合并数据源的问题总是比您预期的更复杂。
And in our experience, problems in matching and merging data sources are always more involved than you anticipate.
解析两个数据源的过程始于映射两个系统中存在的强标识符以链接数据。我们在上一节中刚刚讨论了这一点。我们了解到,两个数据集中可用的强标识符是来自 TMDB 和 IMDB 的电影标识符。
The process of resolving two data sources starts with mapping the strong identifiers present in both systems for linking the data. We just went through that in the last section. We learned that the strong identifiers available across both datasets are the movie identifiers from TMDB and IMDB.
但是,每个数据集的这些 ID 的分布都不同。在研究了这两个来源后,我们了解到以下内容:
However, each dataset has a different distribution of these IDs. After studying the two sources, we learned the following:
MovieLens 数据集中的每个条目都有一个 IMDB 标识符。
Each entry in the MovieLens dataset has an IMDB identifier.
MovieLens 数据集中 1% 的电影缺少 TMDB 标识符。
1% of the movies in the MovieLens dataset are missing a TMDB identifier.
Kaggle 数据集中的每个条目都有一个 TMDB 标识符。
Each entry in the Kaggle dataset has a TMDB identifier.
Kaggle 数据集中 24% 的电影缺少 IMDB 标识符。
24% of the movies in the Kaggle dataset are missing IMDB identifiers.
从这些信息中,我们知道我们将必须构建一个使用数据集中的两个 ID 的流程,因为我们不能总是依赖其中任何一个。
From this information, we know that we are going to have to build a process that uses both IDs from the datasets because we can’t always rely on either.
当您需要合并数据源时,始终从查找和了解每个系统中强标识符的分布开始!
When you need to merge data sources, always start with finding and understanding the distribution of strong identifiers in each system!
让我们讨论一下当所有数据都正确匹配时,我们如何在程序上匹配和合并源之间的数据。在下一节之后,我们将介绍我们在两个数据集中发现的错误以及我们如何解决它们。
Let’s discuss how we procedurally matched and merged the data between the sources when everything matched up correctly. After this next section, we will walk through the errors we discovered in both datasets and how we resolved them.
一开始,我们的合并过程很简单。我们首先处理 MovieLens 数据。然后我们必须弄清楚从 Kaggle 数据集中匹配意味着什么以及如何合并信息。
At the start, our merging process was simple. We started by processing the MovieLens data. Then we had to figure out what it meant to be a match from the Kaggle dataset and how to merge the information.
我们将从 Kaggle 数据源到 MovieLens 数据源的匹配定义为,当 MovieLens 数据集中恰好有一个条目与 TMDB 和/或 IMDB 标识符匹配时。
We defined a match from the Kaggle data source into the MovieLens data source as being when there was exactly one entry in the MovieLens dataset that matched on one or both of the TMDB and IMDB identifiers.
我们为成功匹配和合并所遵循的步骤如示例 11-2所示。
The steps we followed for a successful match and merge are what we have in Example 11-2.
1对于 Kaggle 数据集中的每个 movie_k:2movie_m=根据 movie_k 的 tmdb_id MATCH MovieLens 数据:3if有一个 movie_m: movie_m 的 imdb_id4if的 imdb_id :根据 movie_k 的 imdb_id MATCH MovieLens 数据: tmdb_id_m2: UPSERT kaggle 数据 : 根据 movie_k 的 imdb_id MATCH MovieLens 数据: movie_m 不为空: imdb_id 标识符匹配: UPSERT 数据 : 我们知道 movie_k 不在 MovieLens 数据中 从 Kaggle 插入 movie_kmovie_k==5movie_m2=6iftmdb_id==78else9movie_m=10if11if1213else1415
1For each movie_k in the Kaggle dataset:2movie_m=MATCH MovieLens data by the tmdb_id of movie_k:3ifthere is a movie_m:4ifimdb_id ofmovie_k==imdb_id of movie_m:5movie_m2=MATCH MovieLens data by the imdb_id of movie_k:6iftmdb_id==tmdb_id_m2:7UPSERT the kaggle data8else:9movie_m=MATCH MovieLens data by the imdb_id of movie_k:10ifmovie_m is not null:11ifimdb_id identifiers match:12UPSERT the data13else:14We know movie_k is not in the MovieLens data15INSERT movie_k from Kaggle
示例 11-2中的过程匹配了两个数据库中的 26,853 部电影。在匹配过程之前,MovieLens 数据库中有 252 部电影没有 IMDB 标识符;根据其 TMDB 标识符,使用 Kaggle 数据集找到了其中 15 部电影并进行了解析。
The process in Example 11-2 matched 26,853 movies that were in both databases. Before the matching process, there were 252 movies in the MovieLens database with no IMDB identifier; 15 of those movies were found and resolved with the Kaggle dataset according to their TMDB identifier.
您可能想知道为什么示例 11-2的第 5 行和第 6 行的逻辑是必要的。事实证明,源数据中存在一些错误。我们将在“解决误报”中讨论这些错误。
You may be wondering why the logic on lines 5 and 6 of Example 11-2 is necessary. It turns out there were some errors in the source data. We will get into those errors in “Resolving False Positives”.
举一个更深层次的例子,图 11-11说明了如何使用示例 11-2中概述的流程在两个数据集之间成功匹配电影《玩具总动员》。
For a deeper example, Figure 11-11 illustrates how the movie Toy Story would be successfully matched between the two datasets using the process outlined in Example 11-2.
图 11-11最重要的特征是观察两个源之间的强标识符的值。我们已经从 MovieLens 数据中建模了一部名为“玩具总动员 (1995)”的电影,其 atmdb_id为 862,animdb_id为 0114709。当我们处理 Kaggle 电影时,算法的工作方式如示例 11-3所示。
The most important feature of Figure 11-11 is to observe the values for the strong identifiers between the two sources. We already modeled a movie titled “Toy Story (1995)” from the MovieLens data with a tmdb_id of 862 and an imdb_id of 0114709. When we processed the Kaggle movie, the algorithm worked as shown in Example 11-3.
1对于"Toy Story"Kaggle 数据集中的电影:2movie_m=通过 862 搜索 MovieLens 数据:3if有一个 movie_m:4if0114709==0114709:5movie_m2=通过0114709movie_k 搜索 MovieLens 数据:6if862==862:7UPSERT kaggle 数据
1For the"Toy Story"movie in the Kaggle dataset:2movie_m=search MovieLens data by the 862:3ifthere is a movie_m:4if0114709==0114709:5movie_m2=search MovieLens data by the0114709of movie_k:6if862==862:7UPSERT the kaggle data
我们在插入数据时使用了 UPSERT,因为底层数据存储是 Apache Cassandra。在这种情况下以及大多数情况下,UPSERT 是处理写入的最快方法。
We used UPSERT when we inserted the data because the underlying datastore is Apache Cassandra. In this and most situations, UPSERTs are the fastest way to handle writes.
图 11-12显示了最终进入我们数据集的《玩具总动员》电影的合并版本。
Figure 11-12 shows the merged version of the Toy Story movie that ended up in our dataset.
在此过程中,我们记录了遇到的困难和必须做出的决定。我们将在下一节中介绍这些内容。
Along the way, we documented the tripping points and decisions we had to make. We are going to walk through those in the next section.
当您第一次阅读示例 11-2中的匹配过程时,您可能认为一些额外的检查是多余的。 例如,在确定 Kaggle 数据是否与 MovieLens 电影匹配时,我们首先通过其 TMDB 标识符查找电影,然后通过其 IMDB 标识符再次查找。只有当所有这些场景都找到相同的电影和标识符时,我们才会认为匹配。
When you first read through the matching process in Example 11-2, you may have thought that some of the additional checks were redundant. For example, when determining whether the Kaggle data matches a MovieLens movie, we first found a movie by its TMDB identifier and then looked for it again by its IMDB identifier. Only when all of these scenarios found the same movie and identifiers did we consider it a match.
然而,我们开始的过程发现了 MovieLens 数据中一些非常有趣的东西:它自身的数据中包含误报。
However, the process we started with discovered something really interesting about the MovieLens data: it contained false positives within its own data.
例 11-2中概述的匹配过程首先揭示了 MovieLens 数据库中链接的错误。具体来说,MovieLens数据中出现了 17 次相同的 TMDB 标识符指向不同的 IMDB 标识符的情况。这被称为假阳性。
The matching process outlined in Example 11-2 first revealed errors within the links from the MovieLens database. Specifically, there were 17 occurrences in the MovieLens data of the same TMDB identifier pointing to different IMDB identifiers. This is referred to as a false positive.
A false positive error occurs when the entity resolution process links two references that are not the same.
当我们尝试根据其 TMDB 标识符合并 Kaggle 记录时,我们发现 MovieLens 数据中存在误报。当 Kaggle 条目通过各自的 匹配 MovieLens 电影时tmdb_ids,按 Kaggle 条目的 IMDB 标识符进行顺序查找会从 MovieLens 数据中返回两个结果。
We discovered the false positives within the MovieLens data when we were trying to merge a Kaggle record based on its TMDB identifier. When a Kaggle entry matched a MovieLens movie by their respective tmdb_ids, the sequential lookup by the Kaggle entry’s IMDB identifier returned two results from the MovieLens data.
让我们看看存在的一些误报在MovieLens数据中(见表11-1):
Let’s look at some of the false positives that exist within the MovieLens data (see Table 11-1):
| 电影 ID | imdb_id | tmdb_id | 电影标题 |
|---|---|---|---|
1533 1533 |
0117398 0117398 |
105045 105045 |
无极(1996) The Promise (1996) |
690 690 |
0111613 0111613 |
105045 105045 |
祈求 (1994) Das Versprechen (1994) |
7587 7587 |
0062229 0062229 |
5511 5511 |
萨穆拉伊 (Godson, The) (1967) Samouraï, Le (Godson, The) (1967) |
27136 27136 |
0165303 0165303 |
5511 5511 |
龙芯(1998) The Godson (1998) |
8795 8795 |
0275083 0275083 |
23305 23305 |
武士(2001) The Warrior (2001) |
27528 27528 |
0295682 0295682 |
23305 23305 |
武士(2001) The Warrior (2001) |
要知道这些是相同还是不同的电影,需要抓取原始来源。我们没有对这些示例进行这项工作。因此,对于 MovieLens 数据中这 17 个冲突映射实例,我们从 MovieLens 源中删除了每对记录,即总共 34 个实例。
To know whether these are the same or different movies requires crawling the original sources. We didn’t do that work for these examples. Therefore, for these 17 instances of clashing mappings within the MovieLens data, we removed both of each pair of records, or a total of 34 instances, from the MovieLens source.
通过对 IMDB 和 TMDB 进行更深入的研究,我们发现 Kaggle 数据集包含正确的条目。因此,我们在这些情况下将 Kaggle 数据用作基本事实。
From deeper research on IMDB and TMDB, we found that the Kaggle dataset had the correct entries. Therefore, we used the Kaggle data as the ground truth in these instances.
在解决 MovieLens 源中的问题后,我们收集了有关将两个数据源映射在一起时发现的错误的信息。
After resolving the issue within the MovieLens source, we collected information about the errors found when mapping the two data sources together.
Some statistics about the errors and incorrect matches we found between the datasets are:
零部电影具有匹配的 TMDB 标识符,但具有不匹配的 IMDB 标识符。
Zero movies had matching TMDB identifiers but mismatched IMDB identifiers.
合并数据集产生了 143 个错误,其中源的 IMDB 标识符匹配,但 TMDB 标识符不匹配。
Merging the datasets produced 143 errors in which the sources had matching IMDB identifiers but mismatched TMDB identifiers.
一开始,我们并不知道这 143 个错误是假阳性还是假阴性。我们需要检查它们,以弄清楚它们代表什么类型的错误。
At the start, we did not know whether these 143 errors were false positives or false negatives. We needed to examine them to figure out what type of error they represented.
The additional data about the 143 mismatched movies that is available for comparison is as follows:
每个数据库中的电影标题
The movie’s title in each database
IMDB 上关于该电影的公开页面
The public page about the movie on IMDB
TMDB 上关于这部电影的公开页面
The public page about the movie on TMDB
解决错误时,您需要从已有的数据开始。在本例中,我们可以比较电影名称。表 11-2中分享了这些名称的比较明细。
When resolving errors, you want to start with the data you have. In this case, we can compare movie titles. The breakdown of how those titles compared is shared in Table 11-2.
| 标题不同的原因 | 总发生次数 | 百分比 |
|---|---|---|
“一个” “A” |
5 5 |
3.50% 3.50% |
其实不同 Actually different |
9 9 |
6.29% 6.29% |
相同,但语言不同 Same, but different languages |
1 1 |
0.70% 0.70% |
“这” “The” |
三十六 36 |
25.17% 25.17% |
(年) (year) |
92 92 |
64.34% 64.34% |
MovieLens 数据源表明,MovieLens 在电影标题中包含了发行年份(当这些信息可用时)。因此,我们预计会看到,标题中的许多冲突是由于这些数据的准备方式造成的。表 11-2证实了这一点,两个来源之间不匹配的电影中有 64% 的标题是 MovieLens 标题有,(year)而 Kaggle 标题没有。
The MovieLens data source indicates that MovieLens augmented the titles of movies to contain the release year when that information was available. Therefore, we would expect to see that many of the clashes in titles are due to how this data was prepared. Table 11-2 confirms this, with 64% of the mismatched movies between the two sources having titles where the MovieLens title has the (year) but the Kaggle title does not.
MovieLens 和 Kaggle 数据集中标题不同的其他原因相当有趣:
The remaining reasons the titles were different between the MovieLens and Kaggle datasets are fairly interesting:
3.5% 的时间里,一个标题中包含单词“A”,而另一个标题中没有。
3.5% of the time, one title had the word “A” in it, whereas the other title did not.
25.1% 的时间里,一个标题中包含“The”一词,而另一个标题中则没有。
25.1% of the time, one title had the word “The” in it, whereas the other did not.
有一次,两个书名相同,但语言不同:《The Promise》(英语)和《La Promesse》(法语)。
There was one occurrence in which the titles were the same but in different languages: The Promise (English) versus La Promesse (French).
两个标题实际上有九次不同。
There were nine occurrences of the two titles actually being different.
对不同标题的分析不足以说明这些电影是否是同一部。
The analysis of differing titles did not go far enough to say whether or not the movies were the same.
对于这些不匹配的电影中的 10%,即 15 部电影,我们查看了 TMDB 和 IMDB 中的电影详细信息,以查看哪个来源具有正确的信息。通过深入分析,我们发现 Kaggle 数据源在我们调查的所有情况下都具有正确的 TMDB 和 IMDB 标识符。我们对不匹配电影的深入研究细节如下:
For 10% of these mismatched movies, which is 15 movies, we looked at their movie details in TMDB and IMDB to see which source had the correct information. From this deeper analysis, we found that the Kaggle data source had the correct TMDB and IMDB identifiers in all of the cases we investigated. The details of our in-depth study of mismatched movies are:
在 15 个案例中的 12 个案例中,MovieLens 数据包含指向已被删除的网页的 TMDB 标识符。
In 12 out of the 15 cases, the MovieLens data contained a TMDB identifier that pointed to a web page that had been removed.
通过抓取 TMDB 和 IMDB 上的原始资料,143 部错误匹配的电影中有 15 部在 Kaggle 数据源中具有正确的信息。
15 of the 143 incorrectly matched movies had the correct information in the Kaggle data source based on crawling the original sources at TMDB and IMDB.
对于我们深入调查的所有错误映射,MovieLens 数据源从未有正确的信息。
For all of the incorrect mappings that we deeply investigated, the MovieLens data source never had the correct information.
因此,对于所有 143 次强标识符不匹配的情况,我们依赖来自 Kaggle 数据源的信息。也就是说,我们最终解决的错误包含 143 次误报,其中 MovieLens 数据错误地将 TMDB 标识符链接到 IMDB 标识符。
As a result, for all of the 143 occurrences in which the strong identifiers did not match up, we relied on the information from the Kaggle data source. That is, our final resolved errors contained 143 more false positives where the MovieLens data incorrectly linked a TMDB identifier to an IMDB identifier.
解析过程完成后,合并后的数据库中总共有 329,469 部电影。 有关合并数据集的一些其他统计数据包括:
After we finished the resolution process, there was a total of 329,469 movies in our merged database. Some additional statistics about the merged dataset are:
MovieLens 和 Kaggle 数据源中共有 26,853 部电影。
There are 26,853 movies that are in both the MovieLens and Kaggle data sources.
我们的合并数据库中有 78,480 部没有 IMDB 标识符的电影。
There are 78,480 movies in our merged database with no IMDB identifier.
我们合并的数据库中有 237 部电影没有 TMDB 标识符。
There are 237 movies in our merged database with no TMDB identifier.
我们希望您发现我们合并这些数据集的细节能够说明并代表不那么光鲜的合并数据集过程。这是每个团队在开始使用其数据制作图之前必须经历的常见第一步。
We hope you found the details on how we merged these datasets to be illustrative and representative of the not-so-glamorous process of merging datasets. It is a common first step that every team has to go through before it can get started with using its data in a graph.
这确实引出了一个问题:图如何帮助我们解决电影数据?
Which does raise the question: how could a graph help resolve our movie data?
当我们讨论解决电影数据中的误报时,有一个领域我们可以利用数据中的边缘来解决一些误报。我们来看一个具体的案例。
While we are talking about resolving the false positives in the movie data, there is one area in which we could use edges in our data to resolve some of the false positives. Let’s take a look at a specific case.
如果我们有来自 MovieLens 源的演员,我们可以(假设)使用图结构来帮助解决一些误报。例如,考虑表 11-3中列出的两部电影,它们是 MovieLens 数据的误报。
If we had the actors from the MovieLens source, we could (hypothetically) use graph structure to help resolve some of our false positives. For instance, consider the two movies listed in Table 11-3 that are false positives from the MovieLens data.
| 电影 ID | imdb_id | tmdb_id | 电影标题 |
|---|---|---|---|
8795 8795 |
0275083 0275083 |
23305 23305 |
武士(2001) The Warrior (2001) |
27528 27528 |
0295682 0295682 |
23305 23305 |
武士(2001) The Warrior (2001) |
表 11-3显示了我们掌握的关于这两部电影的所有信息。从我们掌握的数据来看,我们无法确定这两部电影是否是同一部电影。TMDB 标识符相同,但 IMDB 标识符不同。然而,片名是相同的。
Table 11-3 shows all of the information that we have about these two movies. And from the data we have, we cannot confidently conclude whether these are or are not the same movie. The TMDB identifiers are the same, but the IMDB identifiers are different. However, the titles are identical.
我们拥有的数据还不足以做出最终决定。所以让我们看看我们能从这两部电影中找出什么,以便得出结论。
The data we have just isn’t enough to make a conclusive decision. So let’s see what we can figure out about these two movies so we can come to a conclusion.
经过更深入的挖掘,我们可以使用 IMDB 数据来获取每部电影的演员。根据每部电影的演员信息,图 11-13显示了每部电影的图:
After doing some deeper digging, we could use the IMDB data to get the actors for each of these movies. Given the actor information about each movie, Figure 11-13 displays what each of their graphs would look like:
通过解析演员并创建电影与演员之间的关系,我们可以看到这些电影实际上是截然不同的电影。它们的演员名单之间没有共同的演员(尽管我们在图 11-13中仅显示了每个演员名单中的前三位演员)。
By resolving actors and creating relationships from movies to their actors, we can see that these movies are actually distinct and different movies. They have no actors in common between their cast lists (though we are showing just the first three actors in each cast list in Figure 11-13).
图 11-13让您了解何时使用图中的边可以帮助您发现您拥有的数据是否不同。
Figure 11-13 gives you an idea of when using edges from a graph can help you discover whether or not the data you have is distinct.
本章中简单实体解析示例的教训是,实体解析中的大多数任务不需要图结构。定义良好的流程从遵循强标识符的精确匹配开始。在强标识符不够的情况下,您可以依靠字符编辑距离来获取有关数据的下一个最重要的键和值。
The lesson of the simple entity resolution example in this chapter is that the majority of your tasks in entity resolution do not require a graph structure. Well-defined processes start with following exact matches of strong identifiers. In cases when strong identifiers are not enough, you can rely on character edit distances for the next most important keys and values about your data.
然后,在您掌握了基础知识之后,并且如果您的数据中的关系有意义,您可能希望将关系引入到您的实体解析过程中。
Then, after you have covered the basics, and if relationships make sense in your data, you may want to bring relationships into your entity resolution process.
图 11-13说明了在我们的示例中使用图结构进行实体解析的令人信服的理由(在我们用编辑距离解析了强标识符和名称之后!)因为您可以立即看到电影是不同的。而且你可以推断出它们为什么不同。虽然我们当然不能对所有问题都进行这种分析,但图可以成为一种比深入挖掘表格信息来整理答案更有用的工具,可以添加到您的实体解析过程中。
Figure 11-13 illustrates a compelling reason to use graph structure for entity resolution in our example (after we resolved strong identifiers and names with edit distances!) because you can immediately see that the movies are different. And you can infer why they are different. Although we certainly can’t do this kind of analysis for all problems, a graph can be a far more useful tool to add into your entity resolution process than digging deeper into tabular information to sort out the answer.
使用图来解析和合并数据的能力是一个多方面的问题。详尽阐述在何处、何时以及如何使用图来进行广义实体解析的细节将写满一整本书。
The ability to use a graph to resolve and merge data is a multifaceted problem. Elaborating on the full details of where, when, and how to use a graph for generalized entity resolution would fill a whole book.
从这里,让我们回到如何在生产应用程序中大规模地提供这些建议。
From here, let’s get back to how we can deliver these recommendations at scale within a production application.
1 F. Maxwell Harper 和 Joseph A. Konstan,“MovieLens 数据集:历史和背景”, ACM Transactions on Interactive Intelligent Systems (TiiS) 5,第 4 期(2016 年):19, https://doi.org/10.1145/2827872 .;Rappeneau, Stephane。“来自 themoviedb.org 的 350 000+ 部电影”,Kaggle,2016 年 7 月 19 日, https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg。
1 F. Maxwell Harper and Joseph A. Konstan, “The MovieLens Datasets: History and Context,” ACM Transactions on Interactive Intelligent Systems (TiiS) 5, no. 4 (2016): 19, https://doi.org/10.1145/2827872.; Rappeneau, Stephane. “350 000+ movies from themoviedb.org,” Kaggle, 19 July 2016, https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg.
2 Jesse Vig、Shilad Sen 和 John Riedl,“标签基因组:编码社区知识以支持新颖的交互”, ACM 交互式智能系统汇刊(TiiS) 2,第 3 期(2012 年):13, http://doi.acm.org/10.1145/2362394.2362395。
2 Jesse Vig, Shilad Sen, and John Riedl, “The Tag Genome: Encoding Community Knowledge to Support Novel Interaction,” ACM Transactions on Interactive Intelligent Systems (TiiS) 2, no. 3 (2012): 13, http://doi.acm.org/10.1145/2362394.2362395.
Pretty much every application you use these days has a “recommended for you” section.
想想您最喜欢的数字媒体、服装或零售提供商应用程序。我们依靠媒体应用程序中的推荐窗格来查找要观看的新电影或要阅读的书籍。耐克等品牌通过个性化和定制的衣柜来量身定制您的应用内体验。甚至您当地的杂货店应用程序也会为您下次光顾提供推荐的优惠券。
Just think about your favorite applications for digital media, apparel, or retail providers. We rely on the recommendation pane in our media apps to find new movies to watch or books to read. Brands like Nike tailor your in-app experience with personal and customized wardrobes. Even your local grocery store’s app delivers recommended coupons to you for your next visit.
推荐和个性化已经渗透到我们数字体验的几乎每个角落。
Recommendations and personalization have infiltrated almost every nook and cranny of our digital experience.
但是,如何构建一个流程,以我们都期望的速度在应用程序内提供建议呢?
But how do you build a process that delivers recommendations within an application at the speed that we have all learned to expect?
正如我们在第 10 章中介绍的那样,将数据源与图连接起来并为用户创建个性化推荐是完全有可能的。但是,大规模处理基于图的推荐所需的大量数据极大地限制了您在生产应用程序中使用协同过滤的方式。
As we walked through in Chapter 10, it is very possible to connect data sources with a graph and create personalized recommendations for a user. However, the sheer amount of data that is required to process a graph-based recommendation at scale significantly limits how you would use collaborative filtering within a production application.
我们认为 Nike 服装应用的用户不会等待数秒来处理端到端 NPS 启发的协同过滤图查询。您也不应该这样做。
We don’t think a user of Nike’s apparel app is going to wait the multiple seconds required to process an end-to-end NPS-inspired collaborative-filtering graph query. And neither should you.
相反,我们鼓励您像生产工程师一样思考。我们希望设置优先考虑最终用户的应用内体验的程序,然后找出如何将运行时间较长的查询(例如基于图的协作过滤器)与可以保证最终用户在网络响应时间内收到推荐内容的流程联系起来。
Instead, we encourage you to think like a production engineer. We want to set up procedures that prioritize the end user’s in-app experience and then figure out how to connect a longer running query, like a graph-based collaborative filter, with a process that can guarantee your end user receives recommended content within web response time.
本章的重点就是:教您如何将复杂的图问题分解为可以实时查询的部分以及需要批处理的部分。
The focus of this chapter is just that: teaching you how to break down a complex graph problem into a piece that can be queried in real time versus a piece that requires a batch process.
There are four main sections to this final chapter.
我们将首先解释快捷边。我们将向您展示为什么我们的开发过程无法扩展以及捷径边缘如何解决我们的问题。我们还将讨论使用不同修剪技术将捷径边缘用于数据的不同方法。
We will start by explaining shortcut edges. We will show you why our development process doesn’t scale and how shortcut edges solve our problem. We also will talk about different ways to use shortcut edges with your data with different pruning techniques.
在下一节中,我们将解释如何为电影数据预先计算快捷边缘。我们将深入探讨数据并行性以及在集成用于事务查询的较长运行计算时将面临的不同操作挑战。
In the next section we’ll explain how we precomputed shortcut edges for our movie data. We will be diving into data parallelism and the different operational challenges you will face when integrating longer running calculations to be used in a transactional query.
我们的第三部分将介绍我们用于电影数据的最终制作模式。我们将介绍模式代码以及如何加载我们计算的边,正如您已经多次做过的那样。
Our third section will introduce the final production schema we used for our movie data. We will walk through the schema code and how to load the edges we computed, as you have done many times already.
在最后一节中,我们将向您展示如何使用快捷方式边缘向最终用户提供建议。我们将深入研究 Apache Cassandra 中的分区策略,以便您可以推断出使用我们的数据进行不同类型的推荐查询的延迟。
In the last section we will show you how to use the shortcut edges to deliver recommendations to your end users. We will dig deeply into the partitioning strategies within Apache Cassandra so that you can reason about the latencies for different types of recommendation queries with our data.
我们在第 10 章中结束了对推荐的讨论,并使用图查询对图数据执行了协同过滤。我们创建并计算了一个受 NPS 启发的指标,以根据我们的某个用户评分的电影来确定我们应该推荐哪部电影。图 12-1说明了我们构建的方法背后的一般概念。
We left off our discussion of recommendations in Chapter 10 with a graph query that performed collaborative filtering on our graph data. We created and computed an NPS-inspired metric to figure out which movie we should recommend according to the movies rated by one of our users. Figure 12-1 illustrates the general concept behind the approach we built.
图 12-1显示了我们如何从左到右遍历开发图数据以查找建议。如果您一直在关注笔记本,您可能会注意到,如果您想在生产应用程序中使用此方法,这些查询的总体处理时间不会缩短。用户最终将等待太长时间才能获得他们的建议,因为查询需要太长时间才能处理。
Figure 12-1 shows how we walked through our development graph data from the left to the right to find recommendations. If you were following along in the notebook, you likely noticed that the overall processing time for these queries is not going to cut it if you want to use this approach in a production application. The user will end up waiting way too long to get their recommendations because the query takes too long to process.
让我们深入探讨为什么会出现这些问题以及如何解决它们。
Let’s dig into why and then how we will resolve the issues.
我们的开发图查询无法扩展的原因很简单:分支因子和超节点。如果您像我们一样思考,您会同意,对于同时处理这两个问题的适当回应是一个非常讽刺的“好极了”。
The reasons our development graph queries won’t scale are simple to state: branching factor and supernodes. If you think like us, you’ll agree that the appropriate response to having to deal with both of these problems at the same time is a very sarcastic “great.”
但是,如果您还记得的话,我们之前遇到过与图的分支因子和超节点有关的问题。
However, if you recall, we have run into issues with your graph’s branching factor and supernodes before.
在第 6 章中,当我们尝试从一座塔走到所有传感器时,我们第一次在传感器网络中遇到了分支因子。我们数据中边缘的分支因子导致我们的处理开销呈指数增长。
We first ran into branching factor in Chapter 6 within our sensor network when we tried to walk from a tower down to all sensors. The branching factor of the edges in our data created exponential growth in our processing overhead.
在一般的推荐问题中也存在同样的分支问题。当您从用户走到电影再走到用户走到电影时,您的查询会分叉出指数数量的遍历器以处理数据中的所有边。
The same branching problem exists within the general class of recommendation problems. As you walk from a user to movies to users to movies, your queries fork an exponential number of traversers in order to process all of the edges within the data.
我们还必须处理在我们的协同过滤查询中,超节点是最重要的节点。超节点与分支因子密切相关:超节点代表图分支因子的最末端,因为它们是最高度的顶点。
We also have to deal with supernodes in our collaborative-filtering queries. Supernodes are very closely related to branching factor: supernodes represent the extreme end of your graph’s branching factor, as they are the highest degree vertices.
我们在第 9 章创建过滤器和优化路径查找时首次体验了超节点。我们特意从路径查找查询中剔除了高阶顶点,因为它们(通常)在路径查找应用中不会提供有意义的结果。
We first experienced supernodes in Chapter 9 as we created filters and optimizations for pathfinding. We specifically eliminated high degree vertices from our pathfinding queries because they (usually) do not provide meaningful results in pathfinding applications.
我们将不得不在推荐数据中以不同的方式处理超级节点。
We are going to have to deal with supernodes differently in our recommendation data.
在推荐问题中,我们有两种类型的超级节点:超级用户和超级流行内容。超级用户是您平台上的会员,他们查看或评分了几乎所有内容。只要您在协同过滤查询中发现该用户,就会将大量电影插入到您的结果集中。此外,您平台上的大多数用户都会查看或评分非常受欢迎的内容。
In recommendation problems, we have two types of supernodes: the superuser and the superpopular content. A superuser is a member of your platform who has viewed or rated almost every piece of content. Any time that user is discovered during your collaborative-filtering query, a large number of movies are inserted into your result set. There is also very popular content that is viewed or rated by most users on your platform.
与我们在寻路中处理这些超级节点的方式不同,在推荐系统中,您需要在算法中考虑这种流行度,因为它可以指示趋势或高度可能的推荐。
Unlike how we dealt with these supernodes in pathfinding, in a recommendation system you want to account for this type of popularity in your algorithms because it can indicate trending or highly probable recommendations.
那么我们如何解决这两个问题呢?我们围绕它们建立联系。
So how do we get around both of these problems? We build connections around them.
我们还有最后一个制作技巧要教给你:快捷边缘。快捷边是世界各地的团队使用的最流行的技巧之一,用于减轻图的分支因子和超节点在生产查询中的综合风险。
We have one last production trick to teach you: the shortcut edge. Shortcut edges are one of the most popular tricks used by teams around the world to mitigate the combined risk of your graph’s branching factor and supernodes in production queries.
快捷边包含从顶点a到顶点的多跳查询的预先计算结果n,并直接存储为从a到 的边n。
A shortcut edge contains precomputed results of a multihop query from vertex a to vertex n to be stored as an edge directly from a to n.
让我们看看在本章的示例中如何使用快捷边。图 12-2显示了如何使用名为 的边,recommend根据中间受 NPS 启发的用户评分指标将电影直接连接到其推荐。
Let’s look at how we will be using shortcut edges in our example in this chapter. Figure 12-2 shows how we will use an edge called recommend to directly connect movies to their recommendations according to the NPS-inspired metric of our user ratings in the middle.
边缘recommend本质上是在我们的协同过滤查询中最危险的部分上建立一座桥梁,以确保我们的应用程序中的最终用户不必等待。
The recommend edge is essentially building a bridge over the riskiest part of our collaborative-filtering query to ensure that the end user in our application does not have to wait.
recommend使用预先计算的边缘作为从电影直接到推荐的快捷方式的示例您可能正在思考或与您的团队争论为什么我们不直接在用户和内容之间建立推荐边缘。从技术上讲,这是一个可行的选择。但是,我们采取了不同的方法,因为我们希望能够针对用户最近的评分提供即时推荐。
You may be thinking or debating with your team about why we are not building the recommendation edge directly from our user to the content. Technically, that is a viable option. However, we are taking a different approach because we want to be able to provide an immediate recommendation for a user’s most recent rating.
为了从概念上理解如何使用快捷边,让我们深入研究如何使用它们。
To get a conceptual understanding of how shortcut edges will be used, let’s delve into how we want to use them.
仔细考虑在生产中需要查询的内容有助于您定义预先计算快捷边的复杂问题的边界。我们创建了图 12-3来说明我们在最终示例中想要实现的目标。
Thinking through what you need to be able to query in production helps you to define boundaries on the complex problem of precalculating shortcut edges. We created Figure 12-3 to illustrate what we aim to make possible in our final example.
图 12-3展示了在电影推荐生产版本的最终查询中使用快捷边的概念模型。具体来说,我们希望遵循用户最近的推荐来生成排名最高的电影推荐集。为此,我们需要预先计算一条快捷边,称为recommend,将用户最近的电影评分与我们向用户推荐的新内容连接起来。
Figure 12-3 shows a conceptual model of using shortcut edges in the final query for the production version of movie recommendations. Specifically, we want to follow a user’s most recent recommendation to generate the highest ranked set of movie recommendations. To do this, we need to precompute a shortcut edge called recommend that connects a user’s most recent movie rating to the new content we would recommend to the user.
在本节中我们要讨论的最后一个主题是使用数据修剪和计算快捷边的不同方法。
The last topic we would like to address in this section is the different ways to prune and calculate shortcut edges with your data.
使用捷径边缘时的主要技巧归结为预先计算它们跳过的聚合的内容和频率。让我们来谈谈我们建议您在应用程序中考虑的技术。我们将指出我们在此过程中为数据做出的决定。
The main tricks when using shortcut edges come down to what and how often you precompute the aggregations they hop over. Let’s talk about the techniques that we recommend you consider for your application. We will point out the decisions we made for our data along the way.
当您首次探索如何使用捷径边时,您的团队将希望讨论您将在计算过程中建立的限制。通常,有三种方法可以限制捷径边考虑的总数据量:按总得分、按结果总数和按领域知识期望。
When you are first exploring how to use shortcut edges, your team will want to have a discussion on the limitations you will build into the computational process. Typically, there are three ways to limit the total amount of data that is considered for a shortcut edge: by total score, by total number of results, and by domain knowledge expectations.
让我们简要讨论一下每个选项的含义。
Let’s briefly discuss what we mean by each of these options.
过滤快捷边的第一种方法是使用预定义的分数阈值。在这种方法中,如果您计算的分数高于某个阈值,则仅会在推荐中包含快捷方式。
The first way you can filter out shortcut edges is with a predefined score threshold. In this approach, you would only include a shortcut edge for a recommendation if the score you calculate is above some threshold.
您已经在本书中研究过在结果集中使用硬阈值的想法。在第 9 章中定义信任的拐点时,我们介绍了特定阈值的使用。我们推导出了特定点,在该点上,权重高于阈值意味着我们的路径是可信的。该点是一个数学推导的限制,高于该点的路径是应用程序的可信路径,低于该点的路径是不可信的。
You already worked through the idea of using a hard threshold in a result set in this book. We walked through the use of a specific threshold when we defined the inflection point for trust in Chapter 9. We derived the specific point at which a weight above a threshold meant trust for our paths. This point is a mathematically derived limit above which a path is a trusted path for the application and below which it is distrusted.
对于您的建议和我们的电影数据,您不会有这样一个明确的固定点。
For your recommendations and our movie data, you are not going to have such a defined fixed point.
如果您想走这条路,您需要分析数据的推荐分数,以了解您的用户群是否喜欢某个范围的值。遗憾的是,对于我们的电影数据,当涉及到我们的 NPS 启发指标时,我们没有给您一个特定的阈值。但对于您和您的团队来说,考虑为您的数据选择此选项仍然很重要。
If you want to go down this route, you will need to analyze recommendation scores for your data to understand whether a certain range of values is preferred by your user base. Regrettably, for our movie data, we do not have a specific threshold to give you when it comes to our NPS-inspired metric. But it is still important for you and your team to consider this option for your data.
在没有数学阈值(或与数学阈值结合)的情况下,您还可以限制在结果中包含的快捷边的数量。
In the absence of a mathematical threshold (or in combination with one), you can also limit the number of shortcut edges you include in your results.
限制使用快捷边的第二种方法是定义要在生产中包含的边的总数。决定仅存储 100 条捷径边线就是硬性限制的一个例子。您的团队可以选择包含 100 条得分最高的建议,也可以从一定范围内的得分中选择一条。
The second way you can limit your use of shortcut edges is by defining the total number of edges you are going to include in production. Making the decision to store only 100 shortcut edges is an example of a hard limit. Your team can choose to include the 100 highest scoring recommendations, or you can include a selection from a range of scores.
由于两个原因,对边总数的硬性限制可能比特定分数阈值更受欢迎。首先,更容易推断硬性限制对生产应用程序的影响。使用硬性限制,您可以计算在生产中存储和维护此数据所需的总磁盘空间量。其次,您可以通过为用户选择最受欢迎的推荐,在生产查询中合理地使用它们。或者您可以选择用户以前没有看过的推荐度最高的内容。
A hard limit on the total number of edges is probably more popular than a specific score threshold for two reasons. First, it is easier to reason about the effects of the hard limit on your production application. With a hard limit, you can calculate the total amount of disk space required to store and maintain this data in production. Second, you can reasonably use them in your production query by selecting the most popular recommendation for your user. Or you can select the most highly recommended content that your user hasn’t previously watched.
我们将使用硬限制技术来计算电影数据的快捷边缘。
We are going to use the hard-limit technique for the shortcut edges we calculate for our movie data.
一旦您开发并部署了快捷边缘流程,还有一个概念需要考虑,以使您的建议与您的用户更相关。
Once you have developed and deployed your process for shortcut edges, there is one more concept to consider to make your recommendations more relevant to your users.
根据电影类型进行过滤以根据用户的喜好定制推荐是一个如何使用领域知识来修剪推荐的示例。
A filter on a movie’s genre to tailor recommendations to your user’s preferences is an example of how to use domain knowledge to prune your recommendations.
本质上,如果您的用户喜欢电视剧,您也会希望在推荐新电影时包含该类型的过滤器。
Essentially, if your user likes dramas, you will want to also include that type of filter as you recommend new movies.
关于领域知识如何根据我们的电影数据定制推荐,您可以探索无数主题。当您使用 Netflix 时,您已经体验到的最流行的方式是按类型、特定演员或当前趋势进行推荐。
There are a myriad of topics you could explore for how domain knowledge tailors the recommendations for our movie data. The most popular ways you already experience this when you use Netflix are the recommendations by genre, specific actors, or current trends.
最终,根据领域知识过滤推荐是您在应用程序中需要规划的事情。随着应用程序的发展,领域知识过滤器的应用最终将成为应用程序的一个组件。
Ultimately, filtering your recommendations according to domain knowledge is something to plan for in your application. The application of domain knowledge filters will eventually become a component of your application as it evolves.
我们将首先帮助您了解基础知识,以便您可以专注于投入生产所需的一切。
We will get you started with the basics first, so that you can focus on what it takes to get into production.
当计划在生产中提供建议时,您还需要考虑更新快捷边缘的频率。您的团队需要弄清楚如何设计管道以便及时完成计算。
When planning for delivering recommendations in production, you also need to consider how often you are going to update your shortcut edges. And your team needs to figure out how to design your pipelines to finish the computations in good time.
考虑您自己使用推荐窗口的体验,例如 Netflix 上的“为您推荐”部分。您多久登录一次帐户并查看更新的电影推荐列表?您是否可以根据最近的观看历史记录判断该部分何时更新?当我们提到确定更新推荐的频率以提供更好的用户体验时,我们指的是这些问题和注意事项。
Consider your own experience with recommendation windows, like the “Recommended for you” section on Netflix. How often do you log in to your account and see a refreshed list of movie recommendations? Can you tell when the section was updated due to your recent viewing history? These questions and considerations are what we mean when we mention figuring out how often to update your recommendations to provide a better user experience.
计算所有用户评分的捷径边缘非常昂贵。您必须减少团队进行这些计算的频率范围。在设计如何在应用程序中构建捷径边缘程序时,我们建议您与团队讨论以下三个技巧:
Computing a shortcut edge across all of your users’ ratings is very expensive. You are going to have to reduce the scope of how often your team does these calculations. There are three tips that we are going to recommend you discuss with your team when you are designing how you build shortcut edge procedures in your application:
仅更新已更改内容的快捷方式边缘
Updating the shortcut edges only for the content that has changed
构建能够实现成功推荐的数据管道
Building data pipelines that account for successful recommendations
创建强大的计算过程
Creating robust computational processes
我们将简要描述每个提示的含义。
We are going to briefly describe what we mean for each of those tips.
首先,平台上的所有内容并非每天都会被查看或评分,甚至每周也不会。因此,您无需为整个图重新计算捷径边。您的团队将希望找到一种方法,只为最新数据构建捷径边,以跟上趋势。
First, not all content on your platform is going to be viewed or rated every day or even every week. Therefore, you do not need to recompute shortcut edges for your entire graph. Your team will want to find a way to build shortcut edges only for the most updated data to keep up with what is trending.
其次,您需要考虑您的用户群实际点击了哪些推荐,并利用这些信息帮助确定需要重新计算的推荐类别。今天,您可以在应用程序的“流行趋势”部分中体验到这一点。应用程序中的这些信号是需要捕获的一些最重要的功能,因为它们代表了了解用户当前喜欢什么的成功事件。与您的团队一起,规划如何捕获成功的推荐,并利用这些信息来解释当前的趋势。
Second, you will want to consider what recommendations your user base actually clicks on and use this information to help identify what category of recommendations you need to recalculate. Today, you experience this within the “what’s trending” section of your applications. These signals in your application are some of the most important features to capture because they represent a successful event for learning what your users like right now. With your team, plan how you are going to capture successful recommendations and use that information to account for current trends.
最后要考虑的主题是构建稳健的计算过程。当我们说“稳健”时,我们指的是将问题分解为较小的、确定性的、易于重复的计算。使用较小、更局部的计算,而不是较大的全局计算,可以让您的团队拥有更灵活、容错性更强的数据管道。
The last topic to consider focuses on building robust computational processes. When we say “robust,” we are talking about breaking down your problem into smaller, deterministic calculations that are easily repeatable. Using smaller and more local calculations, instead of larger global calculations, allows your team to have a more agile and fault-tolerant data pipeline.
Next, let’s walk through how we computed the shortcut edges for our example.
Shortcut edges help you get around your graph’s branching factor and supernodes at query time.
您无法避免的是预先计算快捷边所需的时间。
What you can’t get around is the amount of time that it takes to precompute shortcut edges.
对于我们的电影数据,我们设置了一个单独的环境来为您预先计算捷径边缘。本节将向您介绍我们所做的工作以及我们做出这些决定的原因。诚然,有很多不同的方法可以设置离线或批处理过程来为您的生产数据添加功能。
For our movie data, we set up a separate environment to precompute the shortcut edges for you. This section walks you through what we did and why we made those decisions. Admittedly, there are many different ways to set up offline or batch processes to add features to your production data.
我们向您展示了一种方法,尽管我们知道它可能并不适合所有情况。我们将在本节末尾指出不同的方法及其权衡利弊。
We are showing you one such approach knowing that it might not be ideal for all situations. We are going to point out different approaches and the trade-offs involved at the end of this section.
我们发现在第 10 章中建立的模式和查询足以计算我们的快捷边。
We found that the schema and query we built up in Chapter 10 was good enough for calculating our shortcut edges.
每个查询都需要更多时间来处理所有数据。
Each query just needed more time to process all of the data.
因此,我们将计算快捷边的过程分解为以下三个步骤:
Therefore, we broke down the process for computing shortcut edges into the following three steps:
找出在生产图中使用 NPS 启发指标所需的模式。
Figure out the schema needed to use the NPS-inspired metric in a production graph.
使用第 10 章中的最后一个查询为一部电影创建快捷边列表。
Use the final query from Chapter 10 to create a list of shortcut edges for one movie.
划分工作来计算具有基本平行性的快捷边。
Divide up the work to calculate shortcut edges with basic parallelism.
让我们首先看一下在这个环境中使用的模式。
Let’s start by taking a look at the schema we used in this environment.
无论采用何种指标,我们的协同过滤查询只需要电影、用户和rated边。对边有两个要求rated。首先,我们需要按排序,rating以便我们可以根据边的评级对边进行分组。其次,我们需要能够双向遍历边。
Regardless of the metric, our collaborative-filtering query simply needs movies, users, and the rated edge. There are two requirements for the rated edge. First, we need it to be sorted by the rating so that we can group the edges according to their rating. Second, we need to be able to traverse the edge in both directions.
这些要求给我们提供了图 12-4中的模式。
These requirements give us the schema in Figure 12-4.
图 12-4中的数据模型描述了我们构建并加载到单独环境中的整个图。我们仅加载了电影和用户顶点。我们创建了一个边,rated,其聚类键为rating。然后我们添加了一个物化视图,以便我们可以在协作过滤查询中反向使用该边。
The data model in Figure 12-4 describes the entire graph we constructed and loaded into the separate environment. We loaded only the movie and user vertices. We created one edge, rated, with a clustering key of the rating. Then we added a materialized view so that we could use the edge in the reverse direction in our collaborative-filtering query.
顶点图 12-4的边标签如下:
The vertex and edge labels for Figure 12-4 are as follows:
schema.vertexLabel("Movie").ifNotExists().partitionBy("movie_id",Bigint).property("tmdb_id",Text).property("imdb_id",Text).property("movie_title",Text).property("release_date",Text).property("production_company",Text).property("overview",Text).property("popularity",Double).property("budget",Bigint).property("revenue",Bigint).create();schema.vertexLabel("User").ifNotExists().partitionBy("user_id",Int).property("user_name",Text).// Augmented, Random Datacreate();
schema.vertexLabel("Movie").ifNotExists().partitionBy("movie_id",Bigint).property("tmdb_id",Text).property("imdb_id",Text).property("movie_title",Text).property("release_date",Text).property("production_company",Text).property("overview",Text).property("popularity",Double).property("budget",Bigint).property("revenue",Bigint).create();schema.vertexLabel("User").ifNotExists().partitionBy("user_id",Int).property("user_name",Text).// Augmented, Random Datacreate();
schema.edgeLabel("rated").ifNotExists().from("User").to("Movie").clusterBy("rating",Double).property("timestamp",Text).create()
schema.edgeLabel("rated").ifNotExists().from("User").to("Movie").clusterBy("rating",Double).property("timestamp",Text).create()
图 12-4显示了一个双向边沿,或需要物化视图的边,代码如下:
Figure 12-4 shows one bidirectional edge, or an edge that needs a materialized view. The code is as follows:
schema.edgeLabel("rated").from("User").to("Movie").materializedView("User__rated__Movie_by_Movie_movie_id_rating").ifNotExists().inverse().clusterBy("rating",Asc).create()
schema.edgeLabel("rated").from("User").to("Movie").materializedView("User__rated__Movie_by_Movie_movie_id_rating").ifNotExists().inverse().clusterBy("rating",Asc).create()
接下来,让我们使用这个数据模型来计算我们的快捷边。
Next, let’s use this data model to calculate our shortcut edges.
鉴于我们的模式,下一步是概述我们将如何使用我们的工作建筑对第 10 章的图数据进行查询。
Given our schema, the next step is to outline how we are going to use our work building queries on the graph data from Chapter 10.
我们提取了为 NPS 启发式查询开发的查询,并对其进行了三处修改:
We lifted the query we developed for the NPS-inspired query and made three modifications to it:
我们想从一部电影而不是一个人开始。
We wanted to start on a movie instead of a person.
我们希望将得分最高的结果限制为 1,000 个。
We wanted to limit to the 1,000 highest scoring results.
我们需要创建一个列表,其中每个条目都有原始电影、推荐电影和 NPS 启发指标。
We needed to create a list in which each entry has the original movie, recommended movie, and NPS-inspired metric.
我们使用了第 10 章中开发的 Gremlin 查询,并进行了这三处小调整。我们将向您展示代码后面的三处修改。此外,示例 12-1显示了我们用于计算给定电影的 1,000 条快捷边的查询。我们选择了 1,000 条,既是为了满足我们即将提出的查询,也是为了为您提供一组有趣的边供您探索(如果您选择的话)。
We used the Gremlin query that we developed in Chapter 10 with these three small adjustments. We will show you where the three modifications are after the code. Further, Example 12-1 shows the query we used to calculate 1,000 shortcut edges for a given movie. We selected 1,000 both to satisfy our upcoming queries and to provide you with an interesting set of edges to explore if you choose.
1g.withSack(0.0).// starting score: 0.02V().has("Movie","movie_id",movie_id).// locate one movie3aggregate("originalMovie").// save as "originalMovie"4inE("rated").has("rating",P.gte(4.5)).outV().// all users who rated it 4.5+5outE("rated").// movies rated by those users6choose(values("rating").is(P.gte(4.5)),// is the rating >= 4.5?7sack(sum).by(constant(1.0)),// if true, add 1 to the sack8sack(minus).by(constant(1.0))).// else, subtract 19inV().// move to movies10where(without("originalMovie")).// remove the original11group().// create a group12by().// keys: movie vertices; will merge duplicate traversers13by(sack().sum()).// values: will sum the sacks from duplicate traversers14unfold().// populate every entry from the map into the pipeline15order().// order the whole pipeline16by(values,desc).// by the values for the individual map entries17limit(1000).// take the first 1000 results, which will be the top 100018project("original","recommendation","score").// structure your results19by(select("originalMovie")).// "original": original movie20by(select(keys)).// "recommendation": rec movie21by(select(values)).// "score": sum of NPS metrics22toList()// wrap the results in a list
1g.withSack(0.0).// starting score: 0.02V().has("Movie","movie_id",movie_id).// locate one movie3aggregate("originalMovie").// save as "originalMovie"4inE("rated").has("rating",P.gte(4.5)).outV().// all users who rated it 4.5+5outE("rated").// movies rated by those users6choose(values("rating").is(P.gte(4.5)),// is the rating >= 4.5?7sack(sum).by(constant(1.0)),// if true, add 1 to the sack8sack(minus).by(constant(1.0))).// else, subtract 19inV().// move to movies10where(without("originalMovie")).// remove the original11group().// create a group12by().// keys: movie vertices; will merge duplicate traversers13by(sack().sum()).// values: will sum the sacks from duplicate traversers14unfold().// populate every entry from the map into the pipeline15order().// order the whole pipeline16by(values,desc).// by the values for the individual map entries17limit(1000).// take the first 1000 results, which will be the top 100018project("original","recommendation","score").// structure your results19by(select("originalMovie")).// "original": original movie20by(select(keys)).// "recommendation": rec movie21by(select(values)).// "score": sum of NPS metrics22toList()// wrap the results in a list
让我们指出示例 12-1中与我们在第 10 章中开发的查询不同的三个地方。首先,示例 12-1中的第 2 行显示了我们如何从一部特定的电影开始。然后,我们按照相同的流程执行协同过滤,并从第 3 行到第 16 行计算受 NPS 启发的指标。
Let’s point out the three places in Example 12-1 that are changes from the query we developed in Chapter 10. First, line 2 in Example 12-1 shows how we started at a specific movie. Then we followed the same process for performing collaborative filtering and calculating an NPS-inspired metrics from line 3 to line 16.
示例 12-1中的最后两个更改是关于格式化结果以供以后使用。第 17 行显示了我们的第二个更改:我们减少了结果总数,仅包含 1,000 个得分最高的推荐。这种限制至关重要,因为当您收集到足够的评分时,这种方法最终将为数据库中所有 327,000 多部其他电影计算一条边。然后,示例12-1中的第 18 行到第 22 行显示了我们如何格式化结果,以便于将我们的工作保存到我们的生产环境中。我们创建了一个包含 1,000 个条目的列表,其结构如下:原创电影、推荐电影,然后是 NPS。
The last two changes in Example 12-1 are about formatting the results for later use. Line 17 shows our second change: we reduced the total number of results to include only the 1,000 highest scoring recommendations. This limitation is vital because this type of approach will eventually compute an edge for all 327,000+ other movies in the database as you collect enough ratings. Then lines 18 through 22 in Example 12-1 show how we formatted our results to make it easy to save our work into our production environment. We created a list that had 1,000 entries with this structure: original movie, recommended movie, and then the NPS.
为了让您对结果有个大致的了解,图 12-5展示了我们其中一部电影的前六条推荐。
To give you an idea of what the results looked like, Figure 12-5 shows the top six recommendations for one of our movies.
588我们在批处理过程中为电影计算的前六条快捷边缘图 12-5中的原始电影是电影588《阿拉丁》。我们计算并保存为快捷边缘的前五名推荐包括《狮子王》、《肖申克的救赎》、《美女与野兽》、《阿甘正传》和《玩具总动员》。
The original movie in Figure 12-5 is movie 588, Aladdin. The top five recommendations that we computed and saved as shortcut edges included The Lion King, The Shawshank Redemption, Beauty and the Beast, Forrest Gump, and Toy Story.
现在您已经了解了我们使用的模式和查询,让我们来讨论一下如何划分工作来完成它。
Now that you see the schema and query that we used, let’s talk about how we divided up the work to get it done.
我们选择将预先计算整个图的快捷边的较大过程分解为较小的、独立的问题。我们可以将电影推荐分解为许多更小的查询,因为每部电影的推荐集都独立于任何其他电影的推荐集。
We opted to decompose the larger process of precalculating shortcut edges for our entire graph into smaller, independent problems. We can break down movie recommendations into many smaller queries because each movie’s set of recommendations is independent from any other movie’s set.
例 12-2概述了我们如何将数据的计算快捷边划分为许多较小的独立查询。
Example 12-2 outlines how we divided up computing shortcut edges for our data into many smaller, independent queries.
1设置:将用户、电影和评分图加载到单独的环境中2分解:将电影分成 N 个较小的独立列表3分配:为每个处理器分配一个列表4编排:同步计算for每部电影的 快捷边5提取:保存要加载到生产图中的结果
1SETUP: Load the users, movies, and ratings graph into a separate environment2DECOMPOSE: Divide the movie into _ids into N smaller, independent lists3ASSIGNMENT: Assign one list per processor4ORCHESTRATION: Synchronously compute shortcut edgesforeach movie5EXTRACTION: Save the results to be loaded into the production graph
示例 12-2中描述的方法采用了一种简单而基本的方式来划分计算电影捷径边缘所需的工作。我们首先设置一个单独的环境,并将用户、电影和评分加载到该环境中。然后,我们将列表分成movie_idsN 个独立组。我们将每个列表分配给一个单独的进程,以便我们可以使用基本并行性同步计算单个电影的捷径边缘。最后,我们将结果保存到一个列表中,然后可以将其加载到我们的生产模型中。
The approach described in Example 12-2 employs a straightforward and basic way to divide up the work required to calculate shortcut edges for our movies. We started by setting up a separate environment and loading just the users, movies, and ratings into that environment. Then we divided a list of movie_ids into N separate and independent groups. We assigned each list to a separate process so that we could use basic parallelism to compute the shortcut edges for an individual movie, synchronously. Last, we saved the results into a list that we could then load into our production model.
将捷径边的计算分解为许多较小的查询遵循称为数据并行的过程。当需要对同一数据的不同子集执行相同的计算时,可以使用数据并行。本质上,您对计算环境中的每个线程使用相同的模型,但提供给每个线程的数据被分割和共享。如果您想了解更多信息,我们推荐 Vipin Kumar 关于该主题的书。1
Breaking up the calculation of shortcut edges into many smaller queries follows a process known as data parallelism. You can use data parallelism when the same computation needs to be performed on different subsets of the same data. Essentially, you use the same model for each thread in your computing environment, but the data given to each of them is divided and shared. We recommend Vipin Kumar’s book on the topic if you want to learn more.1
我们在示例 12-2中概述的方法优先考虑最小化计算时间而不是内存。我们可以将电影推荐分解为许多较小的查询,因为每部电影的推荐集都是独立的。一些复杂的问题,如 PageRank,不能以同样的方式分解。
The approach we outlined in Example 12-2 prioritizes the minimization of computation time over memory. We can break down movie recommendations into many smaller queries because each movie’s set of recommendations is independent. Some complex problems, like PageRank, cannot be decomposed in this same manner.
在我们向您展示如何在生产图中使用它之前,我们需要简要讨论一下解决同一问题的另一种非常常见的方法。
Before we can show you how to use this in your production graph, we need to have a brief side discussion about another very common way to solve this same problem.
在决定如何计算本书的捷径边时,我们必须做出一些权衡。正如您刚刚了解到的,我们决定使用基本并行性来划分工作并独立计算每部电影的快捷边缘。
We had to make some trade-offs when deciding how we were going to approach the computation of shortcut edges for this book. As you just learned, we decided to use basic parallelism to divide up the work and computed shortcut edges for each individual movie independently.
但是,Gremlin 查询语言也具有批量执行模型,主要用于跨大部分图的更大规模批量计算。
However, the Gremlin query language also has a batch execution model that is primarily used for larger batch computations across large portions of the graph.
那么为什么我们不使用批量计算来预先计算快捷边呢?
So why didn’t we use batch computation to precompute the shortcut edges?
对于本书来说,原因主要是范围。使用批量计算引入了足够的深度和复杂性,足以写成另一本书。因此,我们在这里只是为批量图查询提供了一个预告,并让您兴奋地意识到,关于应用图思维,还有更多东西需要学习,而不是我们在本书中能够涵盖的内容。
For this book, the reason is primarily one of scope. Using batch computation introduces enough depth and complexity to fill another book. So we are just providing a teaser here for batch graph queries and will leave you with the exciting realization that there is more to learn about applying graph thinking than we were able to cover in this book.
如果您正在决定使用并行事务查询还是批量计算来进行快捷边缘计算,那么您应该考虑以下一些权衡。
If you are deciding between parallelized transactional queries and batch computation for your shortcut edge computation, here are some of the trade-offs that you should consider.
执行批量计算的 Gremlin 查询可以利用共享计算。由于不必两次遍历图的相同部分,因此可以加快整体执行速度。例如,在我们的示例中,我们有很多由同一位评论者评分的电影。使用事务查询,我们会多次遍历这些评论者顶点。使用批量计算,我们可以将这些计算批量处理在一起。
Gremlin queries that execute batch computations can exploit shared computations. This can result in quicker overall execution because of not having to traverse the same parts of the graph twice. For instance, in our example we have a lot of movies that were rated by the same reviewer. With transactional queries, we traverse through those reviewer vertices multiple times. With batch computation, we can bulk those computations together.
批量计算通常需要更多资源(尤其是内存),这可能会干扰并发事务工作负载。例如,在我们的示例中,并发批量查询可能会对数据库造成足够的压力,从而延迟并发运行的推荐检索查询。这种压力可能会导致更长的延迟和更差的用户体验。因此,批量计算通常在数据库负载较小时或在单独的数据中心和 Cassandra 集群中启动,但这可能不适合您。
Batch computation usually requires more resources (in particular memory), which can interfere with a concurrent transactional workload. For instance, in our example a concurrent batch query may put enough stress on the database to delay concurrently running recommendation retrieval queries. This pressure could lead to longer latencies and a worse user experience. For that reason, batch computations are often started either when there is little load on the database or in a separate data center and Cassandra cluster, but that may not be an option for you.
使用DataStax Graph,可以在分析数据中心(同一集群)进行批量计算。然后,一旦预先计算的结果被写回到图中,它们也会自动复制到运营数据中心。这就是 Apache Cassandra 和 DataStax Graph 的工作负载分离方式。
With DataStax Graph, batch computations can be done in an analytical data center (of the same cluster). Then, once precomputed results are written back into the graph, they are also automatically replicated to the operational data center. This is how workload separation works with Apache Cassandra and DataStax Graph.
通过批量计算,您始终重新计算所有快捷边,这使得该方法具有计算优势。但在很多情况下,您可能希望比其他边更频繁地更新某些快捷边,并且需要事务方法提供的灵活性。
With batch computation, you are always recomputing all shortcut edges, which gives that approach the computational advantage. But in many cases, you may want to update some shortcut edges more frequently than others and need the flexibility that the transactional approach provides.
事务查询允许更有选择性地更新预先计算的边缘。
Transactional queries allow for more selective updates of precomputed edges.
例如,在我们的示例中,随着新评级的涌入,最近电影的快捷边缘正变得越来越陈旧。在这种情况下,我们希望比旧电影更频繁地重新计算这些边缘,因为旧电影几乎没有变化。对于第二种情况,也许您的预计算作业失败了,您必须重新开始。使用较小的事务查询比重新计算整个图更容易跟踪和重新启动。
For instance, in our example the shortcut edges for recent movies are becoming stale more quickly as new ratings start pouring in. In that scenario, we would want to recompute those edges more frequently than we would for old movies, where there is little to no change. For a second scenario, maybe your precalculation job fails and you have to start over. Using smaller transactional queries would be easier to track and restart than having to recompute the whole graph.
如果起始点较少,事务性数据并行方法的效果会更好。在我们的示例中,我们有数千部电影可供开始,这是一个相当小的数字。如果这个数字是数百万,那么事务性方法将花费很长时间(并且很容易出错),这有利于批处理方法。
The transactional approach of data parallelism works better if there are fewer starting points. In our example, we have thousands of movies to start from, which is a rather small number. If that number were in the millions, then the transactional approach would take a long time (and be pretty error-prone), which would favor the batch approach.
还有其他权衡取决于您的具体情况、环境和基础设施,但这些是做出此决定时要考虑的主要因素。
There are other trade-offs that depend on your particular situation, environment, and infrastructure, but those are the main ones to consider in making this decision.
我们选择了使用事务查询的数据并行方法来向您展示我们如何推理计算此示例的快捷方式边缘。它不一定是所有情况下的最佳方法。在确定如何为下一个项目设置预计算快捷方式边缘时,您需要考虑您的环境和应用程序的期望。
We chose the data parallelism approach with transactional queries to show you how we reasoned about calculating shortcut edges for this example. It isn’t necessarily the best approach for all situations. You will need to consider your environment and your application’s expectations when determining how to set up precomputing shortcut edges for your next project.
现在您已经了解了我们如何计算快捷边,让我们展示一下我们在生产中使用的推荐数据模型,加载数据,并最后一次完成我们的查询。
Now that you have an idea of how we computed the shortcut edges, let’s show the recommendation data model that we used in production, load the data, and walk through our queries one last time.
回想一下我们之前展示的概念设计,我们想要提供我们的推荐。图 12-3显示了我们如何查询用户的最新评分,然后使用热门推荐为用户提供新内容。我们在该图中介绍的细节概述了我们希望如何在我们的环境中提供推荐。
Recall a few pages back when we showed you our conceptual design for how we wanted to deliver our recommendations. Figure 12-3 showed how we wanted to query a user’s most recent rating and then use the top recommendations to provide new content to our user. The details we walked through in that image give us the outline for how we want to deliver recommendations in our environment.
让我们了解一下我们将使用的模式以及最后一个例子的最终数据加载过程。
Let’s walk through the schema we will use and the final data loading processes for this last example.
图 12-3中的概念可视化中最重要的细节之一显示在边缘下方。我们看到,我们希望使用用户最近的评分来提供三个得分最高的推荐。这些约束描述了我们如何以最佳方式聚类边缘以获得最佳性能。
One of the most important details of our conceptual visualization in Figure 12-3 was shown underneath the edges. We saw that we wanted to use a user’s most recent rating to deliver the three highest scoring recommendations. These constraints describe how we can cluster our edges most optimally for performance.
图 12-6是我们将用于推荐的最终架构模型。我们将使用与快捷方式边相同的两个顶点标签:用户和电影。
Figure 12-6 is the final schema model that we will use for our recommendations. We will use the same two vertex labels as we had for our shortcut edges: users and movies.
本章中两个模型的区别在于边标签。用户的评分将按时间聚类,以便我们轻松访问最新的评分。然后,我们将使用在“计算电影数据的捷径边”中预先计算的捷径边,直接从给定电影中推荐电影。捷径边将存储为边recommend,按其评分排序。
The differences in the two models in this chapter are the edge labels. The user’s ratings will be clustered by time so that we have easy access to the most recent rating. Then we will use the shortcut edges we precomputed in “Calculating Shortcut Edges for Our Movie Data” to directly recommend movies from a given movie. The shortcut edges will be stored as the recommend edge, sorted by their rating.
Example 12-3 shows the vertex labels, and Example 12-4 shows the edge labels.
schema.vertexLabel("Movie").ifNotExists().partitionBy("movie_id",Bigint).property("tmdb_id",Text).property("imdb_id",Text).property("movie_title",Text).property("release_date",Text).property("production_company",Text).property("overview",Text).property("popularity",Double).property("budget",Bigint).property("revenue",Bigint).create();schema.vertexLabel("User").ifNotExists().partitionBy("user_id",Int).property("user_name",Text).// Augmented, Random Datacreate();
schema.vertexLabel("Movie").ifNotExists().partitionBy("movie_id",Bigint).property("tmdb_id",Text).property("imdb_id",Text).property("movie_title",Text).property("release_date",Text).property("production_company",Text).property("overview",Text).property("popularity",Double).property("budget",Bigint).property("revenue",Bigint).create();schema.vertexLabel("User").ifNotExists().partitionBy("user_id",Int).property("user_name",Text).// Augmented, Random Datacreate();
schema.edgeLabel("rated").ifNotExists().from("User").to("Movie").clusterBy("timestamp",Text,Desc).// Note: changed clustering keyproperty("rating",Text).create()schema.edgeLabel("recommend").ifNotExists().from("Movie").to("Movie").clusterBy("nps_score",Double,Desc).create()
schema.edgeLabel("rated").ifNotExists().from("User").to("Movie").clusterBy("timestamp",Text,Desc).// Note: changed clustering keyproperty("rating",Text).create()schema.edgeLabel("recommend").ifNotExists().from("Movie").to("Movie").clusterBy("nps_score",Double,Desc).create()
The last part of our setup is to walk through how to load the data.
用户和电影顶点的加载方式与我们在第 10 章中介绍的方式相同。我们将跳过该部分加载过程,因为它完全相同,使用相同的文件。
The user and movie vertices will be loaded the same way as we walked through in Chapter 10. We are going to skip that part of the loading process because it is exactly the same, with the same files.
唯一需要加载的新数据是recommend边标签的快捷边。我们创建了一个csv包含所有预计算边的文件,以便我们可以轻松地将它们加载到我们的图中,以供最终生产推荐查询使用。
The only new data to load is the shortcut edges for the recommend edge label. We created a csv file of all of the precomputed edges so that we can load them easily into our graph for our final production recommendation queries.
我们创建了一个文件来加载本章中所有预先计算的快捷边。文件结构如表12-1所示。
We created a file to load with all of the precomputed shortcut edges in this chapter. The file structure is shown in Table 12-1.
| 出场影片编号 | 影片编号 | nps_分数 |
|---|---|---|
588 588 |
364 364 |
4911.0 4911.0 |
588 588 |
318 318 |
4697.0 4697.0 |
588 588 |
595 595 |
4624.0 4624.0 |
588 588 |
356 356 |
4310.0 4310.0 |
588 588 |
1 1 |
4186.0 4186.0 |
588 588 |
593 593 |
3734.0 3734.0 |
正如您在本书中多次看到的那样,构建边文件最困难的部分是确保您的数据、标题和图模式都对齐。表 12-1的标题行显示movie_id每行上的第一个对应于out_movie_id。out_movie_id是我们为其计算推荐的电影。movie_id每行上的第二个对应于in_movie_id。第二个标识符将边与推荐的电影连接起来。每行上的最后一部分数据是nps_score我们已经使用 NPS 启发的协同过滤方法为您计算的推荐。
As you have seen many times in this book, the hardest part of structuring your edge files is making sure your data, header, and graph schema all line up. The header line of Table 12-1 shows that the first movie_id on each line corresponds to the out_movie_id. The out_movie_id is the movie for which we computed a recommendation. The second movie_id on each line corresponds to the in_movie_id. This second identifier connects the edge to the recommended movie. The last piece of data on each line is the nps_score for the recommendation that we already computed for you using the NPS-inspired collaborative-filtering approach.
如果您想确认标题、数据和模式都一致,您可以检查recommend边标签表定义的模式。
If you want to confirm that the header, data, and schema all align, you can inspect the schema of the recommend edge label’s table definition.
最后一步是将快捷边加载到图中。示例 12-5显示了使用批量加载工具加载数据的命令。
The final step is to load the shortcut edges into your graph. Example 12-5 shows the command that loads the data using the bulk loading tool.
dsbulk load -g movies_prod
-e 推荐
-摘自电影
- 电影
-url "short_cut_edges.csv"
-header 为真dsbulk load -g movies_prod
-e recommend
-from Movie
-to Movie
-url "short_cut_edges.csv"
-header true让我们看一下推荐查询的最终版本,看看如何使用这些快捷边缘在几个步骤中提供推荐。
Let’s walk through the final version of our recommendation queries to see how we will use these shortcut edges to deliver recommendations in a few steps.
我们设计了我们的模式并预先计算了我们的快捷边缘,以便我们能够尽快向我们的最终用户提供我们的建议。 提前完成所有工作可确保您的应用程序提供最快、最佳的用户体验。
We designed our schema and precomputed our shortcut edges so that we could deliver our recommendations to our end user as quickly as possible. Going through all of the work ahead of time ensures that your application delivers the fastest and best user experience.
在最后一节中,我们要做三件事。首先,我们要检查我们加载的快捷方式边是否与我们在离线过程中计算的结果相匹配。第二节向您展示如何在三种不同类型的推荐查询中使用这些快捷方式边。最后一节向您展示如何通过映射在我们的三个生产查询中的两个查询期间访问的边缘分区数量来推断查询性能。
In this last section, we want to do three things. First, we want to check that our loaded shortcut edges match what we computed during our offline process. The second section shows you how to use these shortcut edges in three different styles of recommendation queries. The last section shows you how to reason about query performance by mapping the number of edge partitions accessed during two of our three production queries.
让我们首先确认我们的快捷边与我们在“计算电影数据的快捷边”中向您展示的计算相匹配。
Let’s first confirm that our shortcut edges match the computation we showed you in “Calculating Shortcut Edges for Our Movie Data”.
我们拍摄了快照我们在“为我们的电影数据计算捷径边缘”中为电影《阿拉丁》计算的捷径边缘。我们知道阿拉丁的是588,所以在示例12-6中,让我们查询阿拉丁的前5个推荐,以确保它们符合我们的预期。 movie_id
We took a snapshot of the shortcut edges we computed for the movie Aladdin in “Calculating Shortcut Edges for Our Movie Data”. We know that Aladdin’s movie_id is 588, so in Example 12-6, let’s query for Aladdin’s top five recommendations to make sure they match our expectations.
1g.V().has("Movie","movie_id",588).as("original_movie").2outE("recommend").3limit(5).4project("Original","Recommendation","Score").5by(select("original_movie").values("movie_title")).6by(inV().values("movie_title")).7by(values("nps_score"))
1g.V().has("Movie","movie_id",588).as("original_movie").2outE("recommend").3limit(5).4project("Original","Recommendation","Score").5by(select("original_movie").values("movie_title")).6by(inV().values("movie_title")).7by(values("nps_score"))
示例 12-6应用了我们在本书中使用的 Gremlin 模式。第 1 行从对Aladdin电影顶点的分区键查找开始。然后我们recommend在第 2 行走到边缘。
Example 12-6 applies the Gremlin patterns we have used throughout this book. Line 1 starts with a partition key lookup to Aladdin’s movie vertex. Then we walk to the recommend edges on line 2.
示例 12-6的第 3行将两个非常重要的概念结合在一起:Apache Cassandra 中的聚类键和 Gremlin 中的限制。回想一下,边缘recommend是按其评级聚类的。因此,在边缘表上使用limit(5)会根据其评级分数找到前五个推荐,因为它选择了底层表中分区的前五行。这正是 Apache Cassandra 中具有分布式邻接表的 Gremlin 如此之快的原因。
Line 3 of Example 12-6 brings together two very important concepts: clustering keys in Apache Cassandra and limits in Gremlin. Recall that the recommend edges are clustered by their rating. Therefore, the use of limit(5) on the edge table finds the top five recommendations according to their rating score, because it selects the first five rows of the partition in the underlying tables. This is exactly why Gremlin with distributed adjacency lists in Apache Cassandra is so fast.
示例 12-6中第 4 行至第 7 行的其余工作很好地格式化了结果。我们创建了一个用户友好的结构,其中列出了Aladdin、推荐电影的标题及其评分。图 12-7显示了您将在随附的 Studio Notebook中看到的内容。
The remaining work in lines 4 through 7 of Example 12-6 nicely formats the results. We create a user-friendly structure that lists Aladdin, the title of the recommended movie, and its score. Figure 12-7 shows you what you will see in the accompanying Studio Notebook.
图 12-7与我们在“计算电影数据的捷径边缘”中预览的阿拉丁电影的预期前五名推荐相匹配。
Figure 12-7 matches the expected top five recommendations for the Aladdin movie that we previewed in “Calculating Shortcut Edges for Our Movie Data”.
现在让我们使用这些边缘向您展示如何向特定用户提供建议。
Let’s now use these edges to show you how to deliver recommendations to a specific user.
There are three queries we want to show you in this section. They are:
查询 1:用户最近评分的前三条推荐
Query 1: The top three recommendations for the most recent rating by our user
查询 2:用户最近三次评分中的最佳推荐
Query 2: The top recommendation for the three most recent ratings by our user
查询 3:结合查询 1 和查询 2,针对用户最近的三个评分分别提供前三个推荐
Query 3: Combining 1 and 2 to deliver the top three recommendations for each of the three most recent ratings by our user
前两个查询采用不同的方法向最终用户提供三条建议。最后一种方法旨在向您展示如何使用 Gremlin 中的障碍步骤获得九条具体建议。
The first two queries take different approaches to providing three recommendations to the end user. The last approach was designed to show you how to get nine specific recommendations by using barrier steps in Gremlin.
我们设计架构、流程和快捷方式边缘时秉持着一个目标:能够根据用户的最新评分立即提供新内容。我们可以通过访问用户最近评分的电影,然后访问我们预先计算的前三部推荐来实现这一点。 示例 12-7展示了如何在 Gremlin 中执行此操作。
We designed the schema, processes, and shortcut edges with one purpose in mind: to be able to immediately deliver new content according to a user’s most recent rating. We can do that by accessing our user’s most recently rated movie and then accessing our top three precomputed recommendations. Example 12-7 shows how to do this in Gremlin.
1g.V().has("User","user_id",694).// our user2outE("rated").limit(1).inV().// first "rated" edge is most recent3outE("recommend").limit(3).// first three are top 3 recommendations4project("Recommendation","Score").// create a map with two keys5by(inV().values("movie_title")).// move to the movies; get title6by(values("nps_score"))// stay on edges; get score
1g.V().has("User","user_id",694).// our user2outE("rated").limit(1).inV().// first "rated" edge is most recent3outE("recommend").limit(3).// first three are top 3 recommendations4project("Recommendation","Score").// create a map with two keys5by(inV().values("movie_title")).// move to the movies; get title6by(values("nps_score"))// stay on edges; get score
示例 12-7中的 Gremlin 步骤都是我们之前使用过的步骤。此示例的美妙之处在于第 2 行和第 3 行在limit(x)边缘处使用了 。
The Gremlin steps in Example 12-7 are all ones that we have used before. The beauty of this example is shown on lines 2 and 3 with the use of limit(x) on the edges.
在第 2 行,回想一下,rated边是按时间聚类的。因此,outE("rated").limit(1)访问分区中的第一个边,这也是最新的评级。
On line 2, recall that the rated edges are clustered by time. Therefore, outE("rated").limit(1) accesses the first edge in the partition, which is also the most recent rating.
第 3 行使用了相同的访问模式,outE("recommend").limit(3)因为recommend边缘在磁盘上按评级排序。第 4 行至第 6 行使用项目步骤创建用户友好的数据格式。此查询的结果如图 12-8所示。
This same access pattern is used on line 3 with outE("recommend").limit(3) because the recommend edges are sorted on disk by rating. Lines 4 through 6 use the project step to create user-friendly formats of the data. The results of this query are shown in Figure 12-8.
图 12-8显示,根据用户最近的电影评分,向我们推荐的三部最新电影是《后窗》、《卡萨布兰卡》和《奇爱博士》。用户的选择很有趣694。
Figure 12-8 shows that the three newest movies to recommend to our user, according to their most recent movie rating, are Rear Window, Casablanca, and Dr. Strangelove. Interesting choices, user 694.
看到这些电影的评分相对较低,您可能希望通过考虑更多评分来扩大要推荐的电影范围。让我们看看如何使用查询 2 使您的评分多样化。
Seeing that the scores for these movies are relatively low, you may want to broaden the set of movies to recommend by considering more ratings. Let’s take a look at how to diversify your ratings with Query 2.
下一个示例的目标仍然是为用户提供三条建议,但查找方式略有不同。我们想要查询用户最近的三个评分,并为每个评分提供最佳推荐。示例 12-8展示了如何在 Gremlin 中执行此操作。
The goal of this next example is still to provide three recommendations to our user, but to find them a bit differently. We want to query our user’s three most recent ratings and provide the top recommendation for each of them. Example 12-8 shows how we will do this in Gremlin.
1g.V().has("User","user_id",694).// our user2outE("rated").limit(3).inV().// three most recently rated movies3project("rated_movie","recommended_movie","nps_score").// map w/ 3 keys4by("movie_title").// value for the key "rated_movie"5by(outE("recommend").// value for the key "recommended_movie"6limit(1).// first recommendation is top rated7inV().values("movie_title")).// traverse to the movie and get title8by(outE("recommend").// value for the key "nps_score"9limit(1).// first recommendation is top rated10values("nps_score"))// stay on the edge; get the score
1g.V().has("User","user_id",694).// our user2outE("rated").limit(3).inV().// three most recently rated movies3project("rated_movie","recommended_movie","nps_score").// map w/ 3 keys4by("movie_title").// value for the key "rated_movie"5by(outE("recommend").// value for the key "recommended_movie"6limit(1).// first recommendation is top rated7inV().values("movie_title")).// traverse to the movie and get title8by(outE("recommend").// value for the key "nps_score"9limit(1).// first recommendation is top rated10values("nps_score"))// stay on the edge; get the score
示例 12-8的妙处在于它向您展示了如何在推荐集中创建更多多样性。在第 2 行,我们使用了rated边按时间聚类的事实,但这次我们使用 收集了最近的三个评级limit(3)。然后,对于最近的三个评级中的每一个,我们都希望找到最受推荐的电影。这就是我们在第 5 行和第 6 行所做的。我们将每个遍历器限制为其最受推荐的电影,然后访问电影标题。示例 12-8的其余部分格式化了查询的不同部分,以便我们收集结果数据的有意义的视图。图 12-9显示了我们的结果。
The beauty of Example 12-8 is that it shows you how to create more diversity in your recommendation set. On line 2, we use the fact that the rated edges are clustered by time, but this time we collect the three most recent ratings with limit(3). Then, for each of the three most recent ratings, we want to find the top recommended movie. This is what we do on lines 5 and 6. We limit each traverser to its top recommendation and then access the movie title. The rest of Example 12-8 formats different pieces of the query so that we collect a meaningful view of the result data. Figure 12-9 shows our results.
图 12-9显示了我们第一个查询中的一个推荐。Rear Window我们现在看到这部电影是一部名为 的电影的最佳推荐。图 12-9还显示了Safety Last! 用户最近评分的另外两部电影及其最佳推荐。694
Figure 12-9 shows one of the recommendations from our first query, Rear Window. We now see that this movie is the top recommendation for a movie titled Safety Last! The other two movies that user 694 has recently rated are also shown in Figure 12-9 along with their top recommendation.
通过扩大我们用户的近期评分范围,我们找到了一部相当受欢迎的电影:《回到未来》。这部电影的 NPS 启发指标最高,因此也是我们推荐中最受欢迎的电影。
By broadening to a larger set of recent ratings for our user, we were able to find a fairly popular movie: Back to the Future. This movie has the highest NPS-inspired metric and therefore is also the most popular movie in our set of recommendations.
关于这些数据,最后一个自然要问的问题是,收集用户最近三次评分的前三项推荐。对于每个评分,我们将要转到推荐电影集,并要求每个遍历器找到三部。我们在 Gremlin 中以local()遍历器为范围来执行此操作。然后,我们希望将所有推荐重新汇总到一起并将它们合并到一个列表中。让我们在示例 12-9中将其作为此数据集的最后示例。
One last natural question to ask about this data would be to collect, say, the top three recommendations for each of our user’s three most recent ratings. For each rating, we will want to move to the set of recommended movies and ask each traverser to find three. We do this in Gremlin with the local() scope around the traversers. Then we want to bring all recommendations back together and merge them into one list. Let’s make that our last example for this dataset in Example 12-9.
1g.V().has("User","user_id",694).// our user2outE("rated").limit(3).inV().// 3 most recent rated movies3local(outE("recommend").limit(3)).// top 3 recommendations for each movie4group().// create a map5by(inV().values("movie_title")).// keys for the map; merge duplicates6by(values("nps_score").sum()).// values; sum values for duplicates7order(local).// sort the map8by(values,desc)// by its values, descending
1g.V().has("User","user_id",694).// our user2outE("rated").limit(3).inV().// 3 most recent rated movies3local(outE("recommend").limit(3)).// top 3 recommendations for each movie4group().// create a map5by(inV().values("movie_title")).// keys for the map; merge duplicates6by(values("nps_score").sum()).// values; sum values for duplicates7order(local).// sort the map8by(values,desc)// by its values, descending
示例 12-9 的开头与示例 12-8相同,但在第 3 行引入了局部作用域的使用。使用局部作用域可确保每个遍历器抓取三个推荐,以填充到我们在第 4 行构建的映射中。第 5 行向我们展示了此映射的键将是电影标题。第 6 行将所有评级聚合为一个值。图 12-10显示了结果。
Example 12-9 starts off the same as Example 12-8 but introduces the use of local scope on line 3. Using local scope ensures that each traverser grabs three recommendations to populate into the map we construct on line 4. Line 5 shows us that the keys of this map will be the movie titles. Line 6 aggregates all of their ratings into one value. Figure 12-10 shows the results.
图 12-10显示了推荐给用户的电影以及每部电影的最终得分。值得注意的是,我们总共没有九条推荐。图 12-10有八个结果,因为《夺宝奇兵》出现在电影《落水姻缘》和电影《比尔和特德历险记》的推荐中; 《夺宝奇兵》的最终得分汇总了每条推荐的得分。
Figure 12-10 shows the movies to recommend to our user along with each movie’s final score. It is interesting to note that we do not have nine total recommendations. Figure 12-10 has eight results because Raiders of the Lost Ark showed up as a recommendation for the movie Overboard and the movie Bill and Ted’s Excellent Adventure; the final score for Raiders of the Lost Ark aggregated the score from each recommendation.
我们鼓励您在随附的笔记本中尝试这些查询。最值得注意的是,看看当您使用和不使用查询这些数据时会发生什么fold()。当您移除障碍时,您认为结果的结构会发生怎样的变化?你做对了吗?
We encourage you to play around with these queries in the accompanying notebook. Most notably, take a look at what happens when you query this data with and without fold(). How would you expect the structure of the results to change when you remove the barrier? Did you get it right?
在您的申请中,您还需要考虑另外三个主题。
There are three additional topics you will want to consider for your application.
首先,您最想在应用程序中添加的过滤器是根据用户的偏好来限制推荐。这样的过滤器可以删除用户已经看过或评分较低的电影。
First, the most obvious filter you will want to include in your application would limit recommendations according to a user’s preference. Such a filter could remove movies they have already watched or have rated poorly.
要考虑的第二个主题是结果集的大小。我们任意选择了三个推荐,但我们预先计算了每部电影 1,000 个推荐。我们在示例中使用了较小的数字,以有意义的方式说明了这些概念。您可以探索对 1,000 个边进行采样,以比较使用它们的不同方式。
The second topic to consider is the size of the result set. We arbitrarily chose three recommendations, but we precomputed 1,000 recommendations per movie. We used the smaller number in our examples to illustrate the concepts in a meaningful way. You can explore sampling the 1,000 edges to compare different ways to use them.
扩展前三条推荐的最流行方法也是我们为您的推荐查询提供的最后一条建议。在用户查看少量推荐后,您将希望您的应用程序继续滚动并提取更多数据。您将希望设置您的应用程序以便能够向最终用户流式传输更多结果,这就是为什么您可能需要更深入地了解您的快捷方式边缘集。
The most popular way to expand beyond the top three recommendations is also our last tip for your recommendation query. After your user views a small number of recommendations, you will want your application to keep scrolling and pull more data. You will want to set up your application to be able to stream more results to your end user, which is why you will likely need to go deeper into your set of shortcut edges.
希望通过本书的所有练习,您已经学会了如何应用限制和过滤器来将这些选项中的任何一个适应您向用户提供的建议类型。
Hopefully, throughout all of the exercises across this book, you have learned how to apply limits and filters to accommodate any of these options to the type of recommendations you deliver to your users.
Gremlin 中的综合limit(x),结合图模式的分布式架构,是本书中最重要的概念之一。最后一节强调了这一点,真正把它带回家。
The synthesis of limit(x) in Gremlin, combined with your graph schema’s distributed architecture, is one of the most important concepts in this book. This last section emphasizes that point to really bring it home.
我们在上一节中介绍的查询各自为最终用户提供了一组不同的建议。我们发现查询 2 和查询 3 的结果比查询 1 的结果更加多样化。
The queries we introduced in the preceding section each provided a different set of recommendations to your end user. We saw that the results of Query 2 and Query 3 were more diverse than the results of Query 1.
因此,您可能已经得出结论,查询 2 或查询 3 之类的查询更适合您的应用程序,因为它们为您的用户提供了更多选择。事实也确实如此。
Thus you may have concluded that queries like Query 2 or Query 3 are better for your application because they provide more selection for your user. And they may be.
但是,为了理解查询 1、查询 2 和查询 3 的性能权衡,您还需要综合最后一个概念。
But you have one last concept to synthesize in order to understand the performance trade-offs of Query 1, Query 2, and Query 3.
我们每个查询的性能影响都取决于访问多少个边缘分区来提供建议。我们的一个查询在性能方面具有显著优势。
The performance implications for each of our queries come down to how many edge partitions are accessed to deliver the recommendations. And one of our queries has a significant advantage when it comes to performance.
让我们将此处展示的遍历与第 5 章中详述的概念结合起来,列出每次遍历所需的边分区数量。我们首先展示第一个查询所需的边分区数量。
Let’s synthesize the traversals we have shown here with the concepts we detailed in Chapter 5 by laying out the number of edge partitions required in each traversal. We start by showing the number of edge partitions required for our first query.
图 12-11显示,我们需要访问两个单独的边缘分区来提供我们的三个建议。
Figure 12-11 shows that we need to access two separate edge partitions to deliver our three recommendations.
第一个边分区是从用户694到电影顶点的Safety Last!第二个边分区是从单个电影顶点到其评分最高的三部电影的第二个边分区。图 12-11中与图数据一起绘制的表格准确地突出显示了在遍历过程中访问不同边分区的时间。
The first edge partition is from user 694 to the movie vertex for Safety Last! The second edge partition is from the single movie vertex to its three top-rated movies. The tables drawn alongside the graph data in Figure 12-11 highlight exactly when different edge partitions are accessed during our traversal.
让我们通过查看图 12-12中查询 2 所需的边缘分区数来对比查询 1 所需的分区数。
Let’s contrast the number of partitions required for Query 1 by looking at the number of edge partitions required for Query 2 in Figure 12-12.
图 12-12显示,我们需要访问四个单独的边缘分区才能提供三个推荐。第一个分区与我们之前使用的相同:用户的评分边缘。但是,这次我们选择了三个单独的电影顶点。要访问每部电影的最佳推荐,我们必须查看三个不同的分区。因此,查询 2 需要四个不同的边缘分区来找到三个不同的推荐。
Figure 12-12 shows that we need to access four separate edge partitions to deliver our three recommendations. The first partition is the same as we used before: the user’s ratings edges. However, this time we selected three separate movie vertices. To access the top recommendation for each movie, we have to look at three different partitions. Therefore, Query 2 requires four different edge partitions to find three different recommendations.
要考虑的最后一个查询是查询 3,与图 12-12类似,图 12-13显示我们需要访问四个单独的边缘分区才能提供我们的建议。
The last query to consider is Query 3, and like Figure 12-12, Figure 12-13 shows that we need to access four separate edge partitions in order to deliver our recommendations.
第一个分区与我们之前使用的相同:用户的评分边缘。但是,这次我们选择了三个单独的电影顶点。要访问每部电影的前三个不同推荐,我们必须查看三个不同的分区。因此,查询 3 需要四个不同的边缘分区来找到三个不同的推荐。
The first partition is the same as we used before: the user’s ratings edges. However, this time, we selected three separate movie vertices. To access the top three different recommendations for each of our movies, we have to look at three different partitions. Therefore, Query 3 requires four different edge partitions to find three different recommendations.
您还可以在图 12-13中看到,第二部和第三部电影都推荐Back to the Future。回顾一下图 12-10Back to the Future中列出的结果。我们可以在这里看到,汇总两者的得分nps_scores:691.0 + 52.0 = 737.0。
You also see in Figure 12-13 that the second and third movies both recommend Back to the Future. Look back to our results listed in Figure 12-10. We can see here that the score for Back to the Future aggregated both nps_scores: 691.0 + 52.0 = 737.0.
理解分布式环境中查询性能的关键在于综合两个概念。正如我们刚刚了解到的,查询速度与分布式环境中所需的分区数量有直接的关系。
The key for understanding performance of a query in a distributed environment lies in synthesizing two concepts. As we just walked through, there is a direct correlation to a query’s speed according to the number of partitions it requires across a distributed environment.
通过练习,跟踪查询中访问的分区数量将变得更容易。继续思考和可视化您的查询,如图12-13所示,以加深您的理解。
With practice, it will become easier to follow the number of partitions accessed in your query. Continue thinking about and visualizing your queries like what we illustrated in Figure 12-13 to build up your understanding.
影响查询性能的第二个主要因素是数据的连接性。我们在本章中使用了捷径边来同时缓解数据的分支因素和潜在的超节点。
The second main contributing factor to your query’s performance is your data’s connectivity. We used shortcut edges in this chapter to simultaneously mitigate your data’s branching factor and potential supernodes.
我们已在前几节中构建了所有信息,以便您能够推断查询的性能。总而言之,图查询的性能是分布式分区管理和数据分支因子规划之间的复杂平衡。归根结底,这些都是您需要练习的基础概念,以便能够推断分布式图查询的性能。我们希望您觉得它们很有启发性。
We have constructed all the information in the past few sections so that you can reason about your query’s performance. Altogether, your graph query’s performance is an intricate balance between distributed partition management and planning for your data’s branching factor. At the end of the day, these are all the foundational concepts you need to practice in order to be able to reason about the performance of distributed graph queries. We hope you found them instructive.
1 Ananth Grama、Anshul Gupta、George Karypis 和 Vipin Kumar,并行计算简介。第二版。 (波士顿:Addison-Wesley Professional,2003 年)。 https://www.oreilly.com/library/view/introduction-to-parallel/0201648652/。
1 Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Introduction to Parallel Computing. 2nd ed. (Boston: Addison-Wesley Professional, 2003). https://www.oreilly.com/library/view/introduction-to-parallel/0201648652/.
我们非常荣幸您能与我们一起踏上图思维及其在复杂问题中的应用之旅。您学习了解决复杂问题的新思维方式、形式化该思维方式的新理论以及应用该思维方式构建实际解决方案的许多新技术。
We are incredibly honored that you went with us on this journey to graph thinking and its application to complex problems. You learned a new way of thinking for solving complex problems, a new body of theory to formalize that thinking, and a number of new techniques and technologies for applying that thinking in building practical solutions.
正如列奥纳多·达·芬奇所说:“开发人员必须克服睡眠和饥饿才能用语言描述代码示例在瞬间所能表达的内容。”
As Leonardo da Vinci said, “A developer would be overcome by sleep and hunger before being able to describe with words what a code sample can express in an instant.”
和所有技能一样,图思维的掌握可以通过持续练习获得。我们设置了笔记本和示例问题,向您展示如何开始您的新技能。您可以随意继续使用这些笔记本并根据您特定的问题进行调整。
Like with all crafts, mastery of graph thinking can be gained through continued practice. We set up our notebooks and example problems to show you how to get started with your new craft. Feel free to keep playing with those notebooks and adjusting them to suit your particular problems.
我们希望鼓励您将本书中的框架应用于您遇到的问题。本书的前几章向您展示了如何推断哪些问题可以从图思维中受益。我们介绍的标准不是硬性规定,而只是粗略的指导方针,用于辨别问题何时具有使其适合图思维的特征。随着时间的推移,您将建立一种支持这种决策的直觉。
We would like to encourage you to apply the frameworks from this book to the problems that you encounter. The first chapters of this book showed you how to reason about which problems benefit from graph thinking. The criteria we walked through aren’t hard and fast rules but simply rough guidelines for discerning when a problem has the characteristics that make it suitable for graph thinking. Over time, you will build an intuition that will support this decision making.
刚开始时,通过表格数据的关系视角来思考数据问题可能会感觉更自然、更舒服。克服采用不同视角的不适,尝试图思维,尤其是当数据的关系和连接结构对手头的问题很重要时。
As you are starting out, it will likely feel more natural and comfortable to think about your data problems through the relational lens of tabular data. Push through the discomfort of adopting a different perspective and give graph thinking a try, especially when the relationships and connective structure of the data are important to the problem at hand.
用关系视角来表示数据并没有错,我们也不是想说图思维更好。它是不同的,对于某些类型的问题,它更容易、更有效地找到解决方案。掌握这两种视角对于解决复杂问题至关重要,因为它们通常需要分解为需要结合两种视角的子问题。
There is nothing wrong with the relational perspective to representing data, and we are not trying to argue that graph thinking is better. It’s different, and for a certain class of problems, it’s easier and more effective for finding a solution. Mastering both perspectives is critical to solving complex problems since they often need to be broken down into subproblems that require a combination of both perspectives.
当您开始将图思维应用于您的问题时,我们鼓励您遵循我们在本书后面章节中采用的“先开发,后生产”方法。换句话说,从以图的形式探索数据开始,然后快速迭代应用和改进合适的图技术,然后再深入研究微调这些技术以供生产使用。第 4 章至第 12 章向您介绍了我们如何将这种思维方式应用于最常见的关联数据问题:探索邻域、树状分支、查找路径、协同过滤和实体解析。
As you are starting to apply graph thinking to your problems, we encourage you to follow the “development first, production second” approach we took throughout the latter chapters of this book. In other words, start with exploring your data as a graph and quickly iterate toward applying and refining suitable graph techniques before you delve deep into fine-tuning those techniques for production use. Chapter 4 through Chapter 12 walked you through how we approach this mentality for applying graph thinking to the most common connected data problems: exploring neighborhoods, branching in trees, finding paths, collaborative filtering, and entity resolution.
可以将这些技术视为乐高积木,您可以通过各种方式组合和组装它们,以构建适合您的特定应用的解决方案。
Think of those techniques as Lego bricks that you can combine and assemble in various ways to build the solution that works for your particular application.
Is this all there is to know about graph thinking?
远非如此——这只是开始的结束。
Far from it—this is just the end of the beginning.
图解思维是一个非常丰富的主题,与计算机科学、物理学、数学、生物学等许多其他领域都有关系。一旦你习惯通过边连接的顶点来观察问题,你就会惊讶地发现,这种视角的改变在人类探索的各个领域中解锁了多么深刻的理解。
Graph thinking is an incredibly rich topic, with relationships to many other areas in computer science, physics, mathematics, biology, and beyond. Once you become comfortable with viewing a problem through the lens of vertices connected by edges, you’ll be surprised by the depth of understanding that this change in perspective unlocks in various areas of human inquiry.
我们推荐四种途径来继续你的图思维之旅:图算法、分布式图、图论和网络理论。这个列表绝不是全面的,只是粗略地概述了你可以从这里走的许多学习道路。
We recommend four avenues you can take to continue your journey with graph thinking: graph algorithms, distributed graphs, graph theory, and network theory. This list is by no means comprehensive, but just a rough outline of the many learning roads you can take from here.
我们将以对这四个主题的简短介绍和对下一步阅读内容的建议作为结束。
We are going to end with a brief section on each of these four topics and our recommendations on what to read next.
还有另一类图问题需要提及:图算法。与本书教授的特定生产遍历不同,图算法通常需要分析整个图的结构,例如计算有关数据连通性的特定分析。
There is another class of graph problems to mention: graph algorithms. Unlike the specific production traversals taught in this book, graph algorithms typically require analyzing the entire graph’s structure, like computing a specific analytic about the connectedness of your data.
我们在第 10 章中首次看到的协同过滤就是图算法的一个例子。其他流行的图算法包括全对最短路径、PageRank、图着色、连通分量识别、中介中心性、图分割和模块化。
Collaborative filtering, which we first saw in Chapter 10, is one example of a graph algorithm. Other popular graph algorithms are all-pairs shortest path, PageRank, graph coloring, connected components identification, betweenness centrality, graph partitioning, and modularity.
关于图算法,有两个主要概念需要提及。
There are two main concepts to mention about graph algorithms.
第一点是,图算法通常需要对大部分图(如果不是整个图)进行全局计算。我们引入了批量计算作为使用 Gremlin 查询语言对图结构数据进行全局计算的替代方法。
The first point is that a graph algorithm typically requires global computation across most of the graph, if not the entire graph. We teased the introduction of batch computation as an alternative way to use the Gremlin query languge for global computations on graph-structured data.
第二点是要承认有些图算法可以分解为许多局部计算,而有些则不能。我们看到一种图算法,即协同过滤,既可以用全局分布式计算来解决,也可以用许多局部计算来解决。
The second point is to acknowledge that some graph algorithms can be broken down into many localized computations, whereas others cannot. We saw one graph algorithm, namely collaborative filtering, that can be solved either with a global distributed computation or with many localized computations.
大多数比较流行的全局图算法(如 PageRank 和 Connected Components)无法分解为较小的计算,在应用于非常大的图时需要分布式批量计算。对于这类图算法,可能需要以批量计算(或批量同步)模式运行图计算,将计算分布在集群中的多台机器上。
Most of the more popular global graph algorithms, like PageRank and Connected Components, cannot be decomposed into smaller computations and require distributed batch computation when applied to very large graphs. For this class of graph algorithms, it may be necessary to run the graph computation in a batch computing (or bulk synchronous) mode that distributes the computation across multiple machines in a cluster.
如果您有兴趣了解更多有关全局图算法的知识,我们推荐两本书。首先,如果您想深入了解如何以及何时将图算法分解为较小的局部问题,我们建议您学习K. Erciyes 编写的《计算机网络分布式图算法》 (Springer)。其次,动手实践者可能会喜欢 Mark Needham 和 Amy E. Hodler 编写的《图算法:Apache Spark 和 Neo4j 中的实际示例》(O'Reilly)中最流行算法的代码示例。
We recommend two books if you are interested in learning more about global graph algorithms. First, we recommend studying Distributed Graph Algorithms for Computer Networks by K. Erciyes (Springer) f you want to dive deeply into how and when graph algorithms can be broken down into smaller localized problems. Second, the hands-on practitioner may appreciate the code examples for the most popular algorithms in Graph Algorithms: Practical Examples in Apache Spark and Neo4j by Mark Needham and Amy E. Hodler (O’Reilly).
本书强调分布式图。当图太大而无法合理地安装在单台机器上时,由于工作负载要求(即在低延迟下实现一定的吞吐量),或者为了满足数据的地理分布要求,就需要对图进行分布式处理。分布式图尤其具有挑战性,因为您需要将分布式数据的复杂性与图思维的复杂性结合起来。
This book places an emphasis on distributed graphs. Graphs need to be distributed when they are too large to reasonably fit on a single machine, because of workload requirements (i.e., achieving a certain throughput at low latency), or to account for geo-distribution requirements of the data. Distributed graphs are particularly challenging because you are combining the complexity of distributed data with the complexity of graph thinking.
虽然 Cassandra 中的 DataStax Graph 代表用户处理了许多复杂性,例如数据复制和容错,但了解分布式系统的详细工作原理对于理解系统在极端条件下的行为至关重要。
While DataStax Graph in Cassandra handles a lot of this complexity on behalf of the user, such as data replication and fault tolerance, understanding how distributed systems work in detail is critical to understanding the behavior of the systems under extreme conditions.
本书中未详细介绍的一些元素与数据一致性有关。DataStax Graph 使用最终一致性模型,该模型更看重系统正常运行时间,而不是强一致性保证。其他图数据库则做出相反的选择,提供更强的一致性保证,但数据库不可用的可能性更高。
Some elements that we have not addressed in any detail in this book have to do with data consistency. DataStax Graph uses an eventual consistency model that favors system uptime over strong consistency guarantees. Other graph databases make the opposite choice and provide stronger consistency guarantees with a higher likelihood of database unavailability.
什么适合您的应用程序取决于您的业务需求。无论如何,了解系统提供哪些一致性和可用性保证以及如何推理它们非常重要。
What is right for your application depends on your business requirements. In any case, it is important to understand what consistency and availability guarantees are being provided by the system and how to reason about them.
分布式数据库是令人着迷的系统,有关它们的讨论可以写满整本书。我们鼓励读者更多地了解它们。要了解有关 DataStax Graph 底层分布式数据库 Cassandra 的更多信息,我们推荐Jeff Carpenter 和 Eben Hewitt (O'Reilly) 编写的《Cassandra:权威指南》。要了解有关分布式数据库的更一般性讨论,我们推荐M. Tamer Özsu 和 Patrick Valduriez (Springer Science+Business Media) 编写的《分布式数据库系统原理》教科书。
Distributed databases are fascinating systems, and their discussion fills entire books. We encourage our readers to learn more about them. To learn more about Cassandra, the distributed database underlying DataStax Graph, we recommend Cassandra: The Definitive Guide by Jeff Carpenter and Eben Hewitt (O’Reilly). For a more general discussion of distributed databases, we recommend the Principles of Distributed Database Systems textbook by M. Tamer Özsu and Patrick Valduriez (Springer Science+Business Media).
数学中有一门分支学科叫图论,专门研究图结构;本书介绍的许多术语都来自图论。从实践者的角度来看,熟悉这些术语并对图论所划定的区别形成基本的理解是最有用的。
There is a whole branch of mathematics called graph theory, which is dedicated to the study of graph structures; many of the terms introduced in this book stem from graph theory. From a practitioner’s standpoint, it is most useful to familiarize yourself with the terminology and develop a basic understanding of the distinctions that graph theory draws.
如果您想更深入地了解图思维背后的术语和基本概念,我们鼓励您学习图论。图论将教您了解某些类型的图,例如平面图,以及这些图具有哪些特征。您将了解有关著名的“图着色”问题的更多信息。
If you’d like to get a deeper understanding of the terminology and basic concepts underlying graph thinking, we encourage you to study graph theory. Graph theory will teach you about certain classes of graphs, such as planar graphs, and what characteristics these graphs have. You’ll learn more about the famous “graph coloring” problem.
理查德·J·特鲁多 (Richard J. Trudeau) 的《图论导论》(Introduction to Graph Theory) (多佛出版社)是自学图论的一个很好的起点。此外,您还可以在网上找到大量关于图论的入门材料,包括Sarada Herke 的整个 YouTube 频道,专门介绍图论和离散数学。
A good starting point for your self-directed tour of graph theory is Introduction to Graph Theory by Richard J. Trudeau (Dover). Also, you can find a lot of introductory material on graph theory online, including an entire YouTube channel by Sarada Herke dedicated to graph theory and discrete mathematics.
在网上搜索有关图论的内容时,您很快就会遇到网络理论这个术语。
When searching for content online about graph theory, you will quickly run into the term network theory.
网络是与图同义的术语,网络理论是图论在现实世界中的应用。网络理论研究我们周围的现实世界和不同学科中出现的自然图或图结构。
Network is a term used synonymously with graph, and network theory is the application of graph theory to the real world. Network theory studies natural graphs, or graph structures that occur in the real world around us and within different disciplines.
例如,社会学家应用网络理论来研究社交网络并推理自然连接结构。生物学家研究生物世界中出现的图,例如食物网络(或“谁吃谁?”)以及人类体内出现的图,例如分子通路或蛋白质-蛋白质相互作用网络。
For example, sociologists apply network theory to study social networks and reason about natural connected structures. Biologists look at graphs that occur in the biological world, such as food networks (or “who eats whom?”), and within human beings, such as molecular pathways or protein-protein interaction networks.
网络理论的一个有趣发现是,许多自然发生的网络是“无标度的”,这些网络上顶点的度分布具有幂律分布。简而言之,图中有一些顶点有很多边,然后有很多顶点只有几条边。Twitter 就是无标度图的一个很好的例子:Twitter 上很少有人拥有数百万粉丝,而数百万人只有少数粉丝。
One fascinating finding from network theory is that a lot of naturally occurring networks are “scale-free,” and the degree distribution of the vertices on those networks has a power law distribution. Simply put, there are a few vertices in the graph that have a whole lot of edges, and then very many vertices with only a few edges. Twitter is a good example of a scale-free graph: there are very few people on Twitter with millions of followers, and millions of people with only a few followers.
各种各样的自然网络都是无标度的;这就是图中存在超节点的原因。
A wide variety of natural networks are scale-free; this is the reason that supernodes exist in graphs.
一个流行的理论试图解释无标度网络的普遍性,称为优先连接理论,该理论推测,随着新顶点随着时间的推移加入网络,它们更有可能与已经有很多边的顶点建立边。换句话说,这是一个典型的“富人越来越富”现象。
One popular theory trying to explain the prevalence of scale-free networks, called the preferential attachment theory, speculates that as new vertices join a network over time, they are more likely to build edges to vertices that already have a lot of edges. In other words, it’s a classic “the rich get richer” phenomenon.
这对于 Twitter 来说直观上是正确的:如果一个新用户加入 Twitter,他们更有可能关注像巴拉克·奥巴马这样的知名人士,而不是随机的人。
This intuitively holds true for Twitter: if a new user joins Twitter, they are more likely to follow somebody popular like Barack Obama than a random person.
网络理论对自然图及其形成机制有很多论述。这对于图从业者来说很重要,可以确保我们构建的系统在我们针对的图上运行良好。我们已经看到了了解和处理超级节点的重要性。网络理论帮助我们了解超级节点何时以及如何出现。同样,网络理论中还有许多其他主题可以让我们更好地理解某些领域图。
Network theory has a lot to say about natural graphs and the dynamics that shape them. This is relevant for graph practitioners to ensure that the systems we build work well on the graphs we are targeting. We already saw how important it is to be aware of and work around supernodes. Network theory helps us understand when and how supernodes arise. Similarly, there are many other topics within network theory that can give us a better understanding of certain domain graphs.
链接:优先依附理论之父 Albert-László Barabási (Perseus) 所著的《网络的新科学》是一本很好的科普入门书籍,可以让你更好地了解图思维。如果你不怕阅读内容过于复杂,或者不怕深入研究其中的数学原理,我们推荐 Mark Newman 的综述论文《复杂网络的结构和功能》 。1这篇论文对网络科学中的许多领域进行了高水平的介绍,具有足够的数学深度,可以实用,同时又保持了足够的高水平,可以快速涵盖很多内容。它还包含大量对更深入材料的引用。
Linked: The New Science of Networks by Albert-László Barabási (Perseus), the scientific father of the preferential attachment theory, is a good popular science introduction to graph thinking. If you are not afraid of a denser read or of getting into the mathematics of it all, we recommend the survey paper “The Structure and Function of Complex Networks” by Mark Newman.1 This paper provides a high-level introduction to many areas within network science with enough mathematical depth to become practical, while remaining high-level enough to cover a lot of ground quickly. And it contains a lot of references to more in-depth materials.
如果你喜欢这篇关于实用图思维的介绍,并想了解更多,或者加入一群志同道合的人一起踏上旅程,我们鼓励您:
If you’ve enjoyed this introduction to practical graph thinking and want to learn more about it or join a group of like-minded individuals on a shared journey, we encourage you to:
在 Twitter 上关注我们:@Graph_Thinking
Follow us on Twitter: @Graph_Thinking
访问我们的 GitHub:https ://github.com/datastax/graph-book
Visit our GitHub: https://github.com/datastax/graph-book
《图数据从业者指南》封面上的动物是地中海彩虹濑鱼(Coris julis)。这种色彩鲜艳的鱼栖息于从瑞典到塞内加尔再到地中海的东北大西洋。它生活在海岸线附近,喜欢岩石和草地。它以虾和海胆等小型甲壳类动物和海蛞蝓等腹足类动物为食。为了吃掉这种硬壳猎物,彩虹濑鱼进化出了锋利的牙齿和可伸缩的下颌。
The animal on the cover of The Practitioner’s Guide to Graph Data is the Mediterranean rainbow wrasse (Coris julis). This colorful fish inhabits the Northeastern Atlantic from Sweden to Senegal and into the Mediterranean. It lives near the shoreline and favors rocky, grassy areas. It feeds on small crustaceans such as shrimp and sea urchins and gastropods such as sea slugs. To eat its crusty prey, the rainbow wrasse has evolved sharp teeth and a protractile jaw.
彩虹濑鱼是雌雄同体,其颜色和大小会随着其生命周期而变化。这种鱼出生时可能是雄性或雌性,在初级阶段,它们呈棕色,腹部为白色,身体两侧有一条黄橙色条纹。次级阶段的雌性长度可达约 7 英寸,或者它们可能变成次级阶段的雄性,体型可增加至约 10 英寸。次级阶段的雄性色彩更加鲜艳——绿色或蓝色,两侧有一条亮橙色锯齿状条纹。
A sequential hermaphrodite, the rainbow wrasse changes in color and size over its lifespan. These fish may be born either male or female, and in the primary phase they are colored brown with a white belly and a yellow-orange band running down either side of the body. Secondary-phase females reach a length of up to about seven inches, or they may change into secondary-phase males, and increase in size up to about 10 inches. Secondary-phase males are much more colorful—green or blue with a bright orange zig-zag stripe along either side.
地中海彩虹濑鱼的数量稳定,没有受到威胁。O'Reilly 封面上的许多动物都濒临灭绝;它们对世界都很重要。
The population of the Mediterranean rainbow wrasse is stable and not threatened. Many of the animals on O’Reilly covers are endangered; all of them are important to the world.
封面字体为 Gilroy Semibold 和 Guardian Sans。文本字体为 Adobe Minion Pro;标题字体为 Adobe Myriad Condensed;代码字体为 Dalton Maag 的 Ubuntu Mono。
The cover fonts are Gilroy Semibold and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.